These 3 flags are incompatible in regex: [ASCII, UNICODE, LOCALE]. This roughly means that the inline flags [a,u,L] cannot be used together in a same-nested-level flagging. When a when one of these flags is nested deeper than another one, it will override the outer flag for the substring it affects.
There are two problems with this implementation:
Python regex allows this behavior - (?i:) - while this PR does not allow [ignore_case] or [ignore_case ''] (the last one is easy to fix with half a line in compiler.py line:255). However, I don't hate this and think it's probably better behavior.
The bigger problem is the fact that (?u:(?a:somestring)) is allowed in regex, with the ASCII overriding the UNICODE flag (same with all examples of the incompatible flags [ASCII, UNICODE, LOCALE] together).
Here, this is not a valid expression - [unicode [ascii_only 'somestring']] - because nesting depth of an expression is not kept in the parsing flow. Note however, this is kind of generalization of the first problem - the concat operator makes it possible to write this [unicode [ascii_only 'somestring']'anything'], so the expression is invalid only if the outer flag is not operating on anything, but in this case - [unicode [ascii_only 'somestring']] - it seems less obvious to realize what's the problem when translating from regex to kleenexp.
The 3 solutions I can think of for this are:
a. Add a nesting depth value to nodes when parsing - I don't want to do, because I don't understand enough of the parsing and this seems a major change.
b. Add this to a collection of 'problematic behaviors' somewhere.
c. Remove the unicode and locale_dependent flags completely from kleenexp - kind of an overreaction in my opinion.
d. Ignore, which is what I'm doing now and is probably an inferior solution to b.
These 3 flags are incompatible in regex: [ASCII, UNICODE, LOCALE]. This roughly means that the inline flags [a,u,L] cannot be used together in a same-nested-level flagging. When a when one of these flags is nested deeper than another one, it will override the outer flag for the substring it affects. There are two problems with this implementation: Python regex allows this behavior -
(?i:)
- while this PR does not allow[ignore_case]
or[ignore_case '']
(the last one is easy to fix with half a line in compiler.py line:255). However, I don't hate this and think it's probably better behavior. The bigger problem is the fact that(?u:(?a:somestring))
is allowed in regex, with the ASCII overriding the UNICODE flag (same with all examples of the incompatible flags [ASCII, UNICODE, LOCALE] together). Here, this is not a valid expression -[unicode [ascii_only 'somestring']]
- because nesting depth of an expression is not kept in the parsing flow. Note however, this is kind of generalization of the first problem - the concat operator makes it possible to write this[unicode [ascii_only 'somestring']'anything']
, so the expression is invalid only if the outer flag is not operating on anything, but in this case -[unicode [ascii_only 'somestring']]
- it seems less obvious to realize what's the problem when translating from regex to kleenexp. The 3 solutions I can think of for this are: a. Add a nesting depth value to nodes when parsing - I don't want to do, because I don't understand enough of the parsing and this seems a major change. b. Add this to a collection of 'problematic behaviors' somewhere. c. Remove the unicode and locale_dependent flags completely from kleenexp - kind of an overreaction in my opinion. d. Ignore, which is what I'm doing now and is probably an inferior solution to b.