Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
504 stars 85 forks source link

Bug: Pattern init fail in release V3.3.1 #176

Closed LankWang17 closed 1 year ago

LankWang17 commented 1 year ago

Such a pattern string will cause Reflex to hung. Pattern pattern( "Head\b|" "\[System\.Net\.WebRequest\]::" , "i");

If changed "Head\b|" to "Head|", Reflex works well. it seems strange.

LankWang17 commented 1 year ago

Sorry, the bug string is: Pattern pattern( "Head\\b|" "\\[System\\.Net\\.WebRequest\\]::" , "i");

The html editor makes "\\" to "\".

genivia-inc commented 1 year ago

Tried the pattern on MacOS with reflex 3.3.1 and I didn't see anything unusual. Not hanging in the pattern construction. Did you use the pattern to match anything? What OS are you using and what compiler?

Note that the default POSIX matcher does not support \b in the middle of a regex pattern. Only at the begin and/or end of a (sub) pattern. Use a Perl matcher to match \b anywhere. It's OK in your case, since it's at the end of the sub pattern, just before the |. (Just to let you know.)

LankWang17 commented 1 year ago

My env: Windows 10, VS2017, Win32-Debug mode. No defined HAVE_AVX2 and HAVE_AVX512BW.

LankWang17 commented 1 year ago

I used reflex project to test. It hung also. Just like this. 1678924644350

genivia-inc commented 1 year ago

This helps to I can test with MSVC++ in x86 mode tomorrow. Strange if there is somehow a difference with other OS. Or perhaps it's a 32 bit build problem.

LankWang17 commented 1 year ago

image

LankWang17 commented 1 year ago

OK. Thank you.

genivia-inc commented 1 year ago

I see the same thing. It is the combination of pattern option "i" with a word boundary marker \b. This exponentially grows the DFA to 2,622,046 nodes and 3,146,356 edges (!) I will further investigate the cause of this. This may take a bit of time to work on. A POSIX DFA can exponentially grow in the worst case for certain regex patterns that are unlikely to be used in practice and that should not happen in this case.

genivia-inc commented 1 year ago

This works OK as a temporary workaround, placing a \b after the second pattern:

Pattern pattern(
"Head\\b|"
"\\[System\\.Net\\.WebRequest\\]::\\b"
, "i");
LankWang17 commented 1 year ago

This works OK as a temporary workaround, placing a \b after the second pattern:

Pattern pattern(
"Head\\b|"
"\\[System\\.Net\\.WebRequest\\]::\\b"
, "i");

OK, I'll try it.

genivia-inc commented 1 year ago

I found a subtle bug in the DFA constructor. The bug is that a lot of unused DFA fragments get generated. This specific regex pattern triggers this bug. There is nothing intrinsically wrong with the pattern. It should not explode the DFA in size. Will fix it and release an update after testing.

The bug got introduced in 3.7.0 as an optimization to speed up DFA construction, but the optimization did not work correctly with option "I" for case-insensitive matching in combination with other patterns.

genivia-inc commented 1 year ago

Fixed in release 3.3.2.

genivia-inc commented 1 year ago

Thanks for reporting this performance issue. It is very useful to receive feedback. We test the software a lot. Nevertheless, it can happen that something does not work perfectly.