Closed rwst closed 4 years ago
EDIT May 10, 2020:
I've relaxed the limits on the size and complexity of the regex patterns since release v2.1.0.
END EDIT
Use option -P
, which has no limits.
The current limit on the regex string length is 16MB to produce a DFA or when the generated DFA becomes way too complex (exponential). These limits are enforced now, but may certainly increase in the future.
But using -P
is not the same, I get no hits, contrary to what grep -vf
gives me. The line:
ugrep -Pvf uniprot-ids tt>ttt1
Option -P
uses Perl matching with PCRE2 like grep does, which produces the same matches (at least for strings). I suppose your version isn't built with PCRE2 or Boost?
Wrong. config.log:
configure:6914: checking whether the Boost::Regex library is available
configure:6937: g++ -std=gnu++11 -c -I/home/ralf/sage/local/include -I/home/ralf/sage/local/include conftest.cpp >&5
configure:6937: $? = 0
configure:6951: result: yes
Tests are running fine:
*** MULTI-THREADED TESTS ***
cd tests; ./verify.sh
ugrep 2.0.5 x86_64-pc-linux-gnu +avx +boost_regex +zlib +bzip2 +lzma
Have libpcre2? no
Have libboost_regex? yes
Have libz? yes
Have libbz2? yes
Have liblzma? yes
...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
ALL TESTS PASSED
I've started a subproject to improve the RE/flex regex engine, which I expect will increase the permitted length of the regex pattern string up to 4GB. This falls well within your requirements. This work coincides with further improving the speed of DFA construction, especially for constructing DFAs from (lots of) plain strings rather than regex patterns. By comparison, GNU grep uses a different approach for option -F
(or when patterns are "plain strings" to match) to construct a finite state machine, since a set of strings can be very easily and quickly converted into a tree-shaped DFA. We should do something similar.
On the other hand, ugrep is faster to match patterns in the input with the constructed DFAs using my new "predict match" algorithm that I've developed recently and extensively tested for performance compared to hyperscan's vectorization of Bitap. But I have not yet published my work on this effort. Hyperscan's idea is interesting (call it "Bitap buckets", which it is), but actually is slower as they claim depending on the length of patterns matched and their number. It's also not correct in their publications to claim that Bitap cannot handle multiple strings or regex. But all of that is besides the point here.
The details of getting all things perfectly right and with the highest performance possible is what primarily consumed my time. Less urgent usability improvements will follow soon.
Note from the test results that your build does not use PCRE2 but Boost.Regex for option -P
. Boost.Regex has some limitations, but 2GB pattern length should be OK. It should throw an exception when these limits are exceeded. Boost.Regex exceptions are shown by ugrep. Not sure what is happening here.
There was a linking problem with a stray old Boost. Now I indeed get an exception so the limit clearly kicks in. As -x
in grep also uses too much mem I get false negatives from incomplete matches, so I have switched to sorting both files and using comm
.
Using comm
to shorten the pattern file to remove copies of lines makes sense. Also, I recommend to install PCRE2 to use -P
. See the README on how to rebuild ugrep.
Right now, without -P
, the DFA construction cost and size has limits just as grep does (your observation that -x
causes the grep limit to kick in makes therefore sense). Soon we will have improved DFA construction speeds for patterns with lota of plain strings.
Progress on relaxing pattern length limits and speeding up matching with large pattern files in the latest ugrep v2.1.1 release.
This works fine with grep:
Or maybe there is a better way to find all lines of one file that are not in the other?