axiak / pyre2

Python wrapper for RE2
BSD 3-Clause "New" or "Revised" License
295 stars 39 forks source link

pyre2 performance 3 times slower #12

Open andresriancho opened 12 years ago

andresriancho commented 12 years ago

I've been playing around with pyre2 for a while and implemented it into w3af [0]. During this experiment I run the performance tests that come with pyre2 and was amazed with its speed! Then I created a new branch [1] and compared the results of running some of our unit-tests which are very dependent on regular expressions; where I was very surprised to see these results:

With pyre2 [1]:

dz0@dz0-laptop:~/workspace/pyre2$ time nosetests --with-doctest plugins/grep/
.....re2/dfa.cc:447: DFA out of memory: prog size 95874 mem 2412486
................
----------------------------------------------------------------------
Ran 21 tests in 29.475s

OK

real    0m31.980s
user    0m26.150s
sys 0m0.570s

With regular "import re" [2]:

dz0@dz0-laptop:~/workspace/w3af$ time nosetests --with-doctest plugins/grep/
.....................
----------------------------------------------------------------------
Ran 21 tests in 8.093s

OK

real    0m9.971s
user    0m4.550s
sys 0m0.590s

To sum up, pyre2 uses ~30seg to run these tests, while re uses ~10seg.

I've been reading the documentation and it seems that this might be because of unicode strings being passed to pyre2; could you confirm? If so... when are you planning to fix this issue? Is it even fixable? With py3k where everything is a unicode string... this issue will become a huge impediment for future pyre2 users!

Let me know if there is something I can do to help with testing, I would love to have pyre2's performance enhancements in my project's code!

[0] http://w3af.org/ [1] https://w3af.svn.sourceforge.net/svnroot/w3af/branches/pyre2 [2] https://w3af.svn.sourceforge.net/svnroot/w3af/trunk

dupuy commented 11 years ago

The error message in your output: re2/dfa.cc:447: DFA out of memory: prog size 95874 mem 2412486 indicates that your regular expressions were so complex that the generated DFA ran out of memory (during compile or match I can't tell) so pyre2 fell back to using re - so no surprise that it is running slower (on that test, at least). It may also be falling back to re during compile stage on other tests if you are using advanced re features that are not supported by re2. You should at least enable fallback warnings with re.set_fallback_notification(re.FALLBACK_WARNING) when you are doing your comparisons, so you know what's happening, even if you want to fallback quietly in production.

When considering using pyre2 as a replacement for performance reasons you will probably find that it does not outperform re in every case. For very advanced regexps that would cause fallbacks, or when running on short strings with simple regular expressions where the extra overhead of utf-8 conversion outweighs match-time improvements, it would not be surprising for pyre2 to run slower. Keep in mind that while the worst case runtime for pyre2 when given arbitrary (bounded length) text for matching against a complex regular expression is likely to be substantially less than with the alternatives - but for average or typical case runtime, that advantage may be much reduced.

axiak commented 9 years ago

Try running with more memory