regex length limit exceeded

rwst commented 4 years ago

This works fine with grep:

> wc uniprot-ids tt
 181256818  181256818 1875485054 uniprot-ids
    623606     623606    4736122 tt
 181880424  181880424 1880221176 total
> ugrep -vf uniprot-ids tt>ttt
ugrep: error: error in regex at position 16777215
|A0A2A2FGH5|A0A2A2FIR1|A0A2A2FKU0|A0A2A2FKV7|A0A2H9LNT8|B6YTZ0|A2SSR6|A2STZ5|B6
               \___exceeds length limit

Or maybe there is a better way to find all lines of one file that are not in the other?

genivia-inc commented 4 years ago

EDIT May 10, 2020:

I've relaxed the limits on the size and complexity of the regex patterns since release v2.1.0.

END EDIT

Use option -P, which has no limits.

The current limit on the regex string length is 16MB to produce a DFA or when the generated DFA becomes way too complex (exponential). These limits are enforced now, but may certainly increase in the future.

rwst commented 4 years ago

But using -P is not the same, I get no hits, contrary to what grep -vf gives me. The line:

ugrep -Pvf uniprot-ids tt>ttt1

genivia-inc commented 4 years ago

Option -P uses Perl matching with PCRE2 like grep does, which produces the same matches (at least for strings). I suppose your version isn't built with PCRE2 or Boost?

rwst commented 4 years ago

Wrong. config.log:

configure:6914: checking whether the Boost::Regex library is available
configure:6937: g++ -std=gnu++11 -c  -I/home/ralf/sage/local/include -I/home/ralf/sage/local/include  conftest.cpp >&5
configure:6937: $? = 0
configure:6951: result: yes

rwst commented 4 years ago

Tests are running fine:

*** MULTI-THREADED TESTS ***

cd tests; ./verify.sh
ugrep 2.0.5 x86_64-pc-linux-gnu +avx +boost_regex +zlib +bzip2 +lzma
Have libpcre2? no
Have libboost_regex? yes
Have libz? yes
Have libbz2? yes
Have liblzma? yes
...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
ALL TESTS PASSED

genivia-inc commented 4 years ago

I've started a subproject to improve the RE/flex regex engine, which I expect will increase the permitted length of the regex pattern string up to 4GB. This falls well within your requirements. This work coincides with further improving the speed of DFA construction, especially for constructing DFAs from (lots of) plain strings rather than regex patterns. By comparison, GNU grep uses a different approach for option -F (or when patterns are "plain strings" to match) to construct a finite state machine, since a set of strings can be very easily and quickly converted into a tree-shaped DFA. We should do something similar.

On the other hand, ugrep is faster to match patterns in the input with the constructed DFAs using my new "predict match" algorithm that I've developed recently and extensively tested for performance compared to hyperscan's vectorization of Bitap. But I have not yet published my work on this effort. Hyperscan's idea is interesting (call it "Bitap buckets", which it is), but actually is slower as they claim depending on the length of patterns matched and their number. It's also not correct in their publications to claim that Bitap cannot handle multiple strings or regex. But all of that is besides the point here.

The details of getting all things perfectly right and with the highest performance possible is what primarily consumed my time. Less urgent usability improvements will follow soon.

genivia-inc commented 4 years ago

Note from the test results that your build does not use PCRE2 but Boost.Regex for option -P. Boost.Regex has some limitations, but 2GB pattern length should be OK. It should throw an exception when these limits are exceeded. Boost.Regex exceptions are shown by ugrep. Not sure what is happening here.

rwst commented 4 years ago

There was a linking problem with a stray old Boost. Now I indeed get an exception so the limit clearly kicks in. As -x in grep also uses too much mem I get false negatives from incomplete matches, so I have switched to sorting both files and using comm.

genivia-inc commented 4 years ago

Using comm to shorten the pattern file to remove copies of lines makes sense. Also, I recommend to install PCRE2 to use -P. See the README on how to rebuild ugrep.

Right now, without -P, the DFA construction cost and size has limits just as grep does (your observation that -x causes the grep limit to kick in makes therefore sense). Soon we will have improved DFA construction speeds for patterns with lota of plain strings.

genivia-inc commented 4 years ago

Progress on relaxing pattern length limits and speeding up matching with large pattern files in the latest ugrep v2.1.1 release.

Genivia / ugrep

regex length limit exceeded #26