SonOfLilit / kleenexp

modern regular expression syntax everywhere with a painless upgrade path
MIT License
73 stars 16 forks source link

Feature/add inline flags #34

Open Yehuda-blip opened 1 year ago

Yehuda-blip commented 1 year ago

These 3 flags are incompatible in regex: [ASCII, UNICODE, LOCALE]. This roughly means that the inline flags [a,u,L] cannot be used together in a same-nested-level flagging. When a when one of these flags is nested deeper than another one, it will override the outer flag for the substring it affects. There are two problems with this implementation: Python regex allows this behavior - (?i:) - while this PR does not allow [ignore_case] or [ignore_case ''] (the last one is easy to fix with half a line in compiler.py line:255). However, I don't hate this and think it's probably better behavior. The bigger problem is the fact that (?u:(?a:somestring)) is allowed in regex, with the ASCII overriding the UNICODE flag (same with all examples of the incompatible flags [ASCII, UNICODE, LOCALE] together). Here, this is not a valid expression - [unicode [ascii_only 'somestring']] - because nesting depth of an expression is not kept in the parsing flow. Note however, this is kind of generalization of the first problem - the concat operator makes it possible to write this [unicode [ascii_only 'somestring']'anything'], so the expression is invalid only if the outer flag is not operating on anything, but in this case - [unicode [ascii_only 'somestring']] - it seems less obvious to realize what's the problem when translating from regex to kleenexp. The 3 solutions I can think of for this are: a. Add a nesting depth value to nodes when parsing - I don't want to do, because I don't understand enough of the parsing and this seems a major change. b. Add this to a collection of 'problematic behaviors' somewhere. c. Remove the unicode and locale_dependent flags completely from kleenexp - kind of an overreaction in my opinion. d. Ignore, which is what I'm doing now and is probably an inferior solution to b.