Open padde opened 3 years ago
I had to deal with a similar issue very recently. FWIW this is a behaviour in a lot of regex engines - at least in PCRE2, PCRE, JavaScript, Go (RE2), and Rust's regex
Well the engine works just fine, but the regexp is set up wrong AFAICS. The lib itself is adding the surrounding ^$
but to implement implicit anchoring. However the implementation is too naive and disregards the fact that there are lower precedence operators. As can be seen with my prefix/suffix examples the anchoring does not even work correctly sometimes, because the library-added anchors are gobbled up. Changing the precedence might not be the fix after all. Maybe the implicit anchoring needs to be implemented differently altogether.
Ah. Yes, I agree. The behaviour of the implicitly added anchors in this case is indeed incorrect
Hi, I'm not able to look at this problem just now due to other work, so I put it in "Help Wanted". BR Lars
@padde, I do happen to know a bit about this problem. And as much as I'd love to help on the "help wanted" tag, I just don't have the time right now. You (or anyone else) could have a look at https://github.com/zadean/xs_regex for a start? It might have a bit more than is needed since it has to handle XQuery/XPath regex stuff, but could still be a starting point for seeing what needs to be done here?
Describe the bug The OR-operator
|
behaves unexpectedly when used without an enclosing capturing group.The anchoring operators
^
and$
are automatically added byxmerl_regexp
to implement the XML schema requirement that regular expressions match the entire attribute/text content of a tag. Due to operator precedence the regexpa|b
is internally setup as^a|b$
, which is equivalent to(^a)|(b$)
. However what should result is a regexp equivalent to^(a|b)$
.To Reproduce
Expected behavior Lines 2 and 3 from the above example should match, but not lines 4 and 5. Lines 7-10 do match as expected.
Affected versions Tested in 22, but from skimming over the xmerl_regexp code it seems the bug has been there for a long time.
Additional context Our current workaround is to modify the
xmerl_regexp:setup/1
function so that it adds a capture group around any regexp. This is of course not a proper fix, because it will increase the number of capture groups and thus break numeric references to capture groups like\1
when using the replacement functions likegsub
, or the offset would have to be introduced there as well.When compiling the regexp without optimization using
xmerl_regexp:parse
directly, line 3 seems to match as expected. The failure of line 3 may thus not be attributed solely to wrong operator precedence - it might be a separate bug in the code that interprets compiled regular expressions (NFA form). However, this code path exhibits the same issues with regard to operator precedence, as the examples from lines 4 and 5 also match here but should not.Thinking about a solution, please consider that simply adding
^
and$
around the regexp might not be a good way to implement this after all. Maybe the anchoring should be performed in a later stage, e.g. adding it directly to the syntax tree and/or checking that the whole regexp matched by other means.