Unexpected behaviour of alternation operator `|` in xmerl_regexp

padde commented 3 years ago

Describe the bug The OR-operator | behaves unexpectedly when used without an enclosing capturing group.

The anchoring operators ^ and $ are automatically added by xmerl_regexp to implement the XML schema requirement that regular expressions match the entire attribute/text content of a tag. Due to operator precedence the regexp a|b is internally setup as ^a|b$, which is equivalent to (^a)|(b$). However what should result is a regexp equivalent to ^(a|b)$.

To Reproduce

1> {ok, RE1} = xmerl_regexp:setup("a|b").
{ok,{comp_regexp,{{{c_state,1,none,none,none,none,none,
                            false,[]},
                   {c_state,2,{5},97,1,99,1,false,[{bos,3}]},
                   {c_state,3,{4},96,1,98,1,false,[]},
                   {c_state,4,none,none,none,none,none,true,[]},
                   {c_state,5,none,none,none,none,none,false,[{eos,4}]}},
                  2,1}}}

2> xmerl_regexp:first_match("a", RE1).
{match,1,1}
3> xmerl_regexp:first_match("b", RE1).
nomatch
4> xmerl_regexp:first_match("a-", RE1).
{match,1,1}
5> xmerl_regexp:first_match("-b", RE1).
{match,2,1}

6> {ok, RE2} = xmerl_regexp:setup("(a|b)").
{ok,{comp_regexp,{{{c_state,1,none,none,none,none,none,
                            false,[]},
                   {c_state,2,none,none,none,none,none,false,[{bos,5}]},
                   {c_state,3,none,none,none,none,none,true,[]},
                   {c_state,4,none,none,none,none,none,false,[{eos,3}]},
                   {c_state,5,{4,4},96,1,99,1,false,[]}},
                  2,1}}}

7> xmerl_regexp:first_match("a", RE2).
{match,1,1}
8> xmerl_regexp:first_match("b", RE2).
{match,1,1}
9> xmerl_regexp:first_match("a-", RE2).
nomatch
10> xmerl_regexp:first_match("-b", RE2).
nomatch

Expected behavior Lines 2 and 3 from the above example should match, but not lines 4 and 5. Lines 7-10 do match as expected.

Affected versions Tested in 22, but from skimming over the xmerl_regexp code it seems the bug has been there for a long time.

Additional context Our current workaround is to modify the xmerl_regexp:setup/1 function so that it adds a capture group around any regexp. This is of course not a proper fix, because it will increase the number of capture groups and thus break numeric references to capture groups like \1 when using the replacement functions like gsub, or the offset would have to be introduced there as well.

--- a/src/xmerl_regexp.erl
+++ b/src/xmerl_regexp.erl
@@ -46,7 +46,7 @@
 -include("xmerl_internal.hrl").

 setup(RE0) ->
-    RE = setup(RE0, [$^]),
+    RE = setup(RE0, [$(, $^]),
     Pid = spawn(?MODULE,compile_proc,[self(),RE]),
     receive
        {ok,Result} ->
@@ -184,7 +184,7 @@ setup([$\\,$C|S],Acc) -> setup(S,"]9-0z-aZ-A_:."++[183]++"-^[" ++Acc);
 %% fixme setup(["\\p{Cn}" ++ S) -> {{char_class,"\s\t\n\r"},S};
 %% fixme setup(["\\P{Cn}" ++ S) -> {{comp_class,"\s\t\n\r"},S};
 setup([A|S], Acc) -> setup(S, [A|Acc]);
-setup([],Acc) ->  reverse([$$|Acc]).
+setup([],Acc) ->  reverse([$$, $)|Acc]).

 %% sh_to_awk(ShellRegExp)
 %%  Convert a sh style regexp into a full AWK one. The main difficulty is

When compiling the regexp without optimization using xmerl_regexp:parse directly, line 3 seems to match as expected. The failure of line 3 may thus not be attributed solely to wrong operator precedence - it might be a separate bug in the code that interprets compiled regular expressions (NFA form). However, this code path exhibits the same issues with regard to operator precedence, as the examples from lines 4 and 5 also match here but should not.

Thinking about a solution, please consider that simply adding ^ and $ around the regexp might not be a good way to implement this after all. Maybe the anchoring should be performed in a later stage, e.g. adding it directly to the syntax tree and/or checking that the whole regexp matched by other means.

michalmuskala commented 3 years ago

I had to deal with a similar issue very recently. FWIW this is a behaviour in a lot of regex engines - at least in PCRE2, PCRE, JavaScript, Go (RE2), and Rust's regex

padde commented 3 years ago

Well the engine works just fine, but the regexp is set up wrong AFAICS. The lib itself is adding the surrounding ^$ but to implement implicit anchoring. However the implementation is too naive and disregards the fact that there are lower precedence operators. As can be seen with my prefix/suffix examples the anchoring does not even work correctly sometimes, because the library-added anchors are gobbled up. Changing the precedence might not be the fix after all. Maybe the implicit anchoring needs to be implemented differently altogether.

michalmuskala commented 3 years ago

Ah. Yes, I agree. The behaviour of the implicitly added anchors in this case is indeed incorrect

lthor commented 3 years ago

Hi, I'm not able to look at this problem just now due to other work, so I put it in "Help Wanted". BR Lars

zadean commented 3 years ago

@padde, I do happen to know a bit about this problem. And as much as I'd love to help on the "help wanted" tag, I just don't have the time right now. You (or anyone else) could have a look at https://github.com/zadean/xs_regex for a start? It might have a bit more than is needed since it has to handle XQuery/XPath regex stuff, but could still be a starting point for seeing what needs to be done here?

erlang / otp

Unexpected behaviour of alternation operator `|` in xmerl_regexp #4591