Forever-Young / mrab-regex-hg

Automatically exported from code.google.com/p/mrab-regex-hg
0 stars 0 forks source link

regex.match() hangs #88

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The following call doesn't terminate after 45 minutes:

>>> regex.match(r'.*a.*ba.*aa', "ababba")

The same for some other cases with ".*" subpattern repeated several times, when 
the regex does not match the string.

The pattern and string used above are very short, so I suspect this is hanging 
rather than long computation. 

By the way, what is the computational complexity of regex.match() in the worst 
case? Is it polynomial or exponential? how it compares to standard 're'? Thx.

What version of the product are you using? On what operating system?

0.1.20130120
Linux, Python 2.7.2

Original issue reported on code.google.com by mwojn...@gmail.com on 24 Jan 2013 at 12:20

GoogleCodeExporter commented 9 years ago
I think it's probably catastrophic backtracking. I'll need to discover why the 
existing preventative measures aren't managing to stop it in this case (it's 
not possible to stop it altogether).

I believe that catastrophic backtracking is exponential in behaviour.

Original comment by re...@mrabarnett.plus.com on 24 Jan 2013 at 2:27

GoogleCodeExporter commented 9 years ago
In standard 're', the same call does terminate and takes only 4 us - so if this 
is an issue of long computation, it means that 'regex' module is in some cases 
less efficient than 're', right? which in turn means that 'regex' won't work as 
a general replacement for 're', because some existing code may break.

I'm not an expert in parsers implementation, but this problem of backtracking 
reminds me of a similar issue that occurs in PEG parsers 
(http://en.wikipedia.org/wiki/Parsing_expression_grammar), where basic 
implementation of backtracking has exponential worst-case complexity, but when 
coupled with a technique of 'memoization' the complexity stays always *linear* 
even for difficult cases, only at the cost of higher memory consumption (but 
still at most linear in the total size of input). See:
http://en.wikipedia.org/wiki/Memoization
http://en.wikipedia.org/wiki/Parsing_expression_grammar#Implementing_parsers_fro
m_parsing_expression_grammars

I think PEGs are more general than regex'es, so the solution should be 
applicable to regexes, too, although I can't say anything more specific.

Original comment by mwojn...@gmail.com on 24 Jan 2013 at 11:30

GoogleCodeExporter commented 9 years ago
Sorry for doubting you, it _is_ a bug. :-(

It'll be fixed in the next release.

Original comment by re...@mrabarnett.plus.com on 24 Jan 2013 at 2:03

GoogleCodeExporter commented 9 years ago
Fixed in regex 0.1.20130124.

Original comment by re...@mrabarnett.plus.com on 24 Jan 2013 at 8:31

GoogleCodeExporter commented 9 years ago
I've updated the package, but the bug is still present. In ipython:

>>> import regex
>>> regex
<module 'regex' from 
'/usr/local/lib/python2.7/dist-packages/regex-0.1.20130124-py2.7-linux-i686.egg/
regex.pyc'>

(version 0.1.20130124 - newest one)

>>> regex.match(r'.*a.*ba.*aa', "ababba")
-- doesn't terminate after 2 hours

Original comment by mwojn...@gmail.com on 25 Jan 2013 at 5:07

GoogleCodeExporter commented 9 years ago
It got unfixed during the other work. I'm surprised you let it run for 2 hours!

Fixed in regex 0.1.20130125.

Original comment by re...@mrabarnett.plus.com on 25 Jan 2013 at 8:52

GoogleCodeExporter commented 9 years ago
Thanks, works for me now.

Original comment by mwojn...@gmail.com on 26 Jan 2013 at 12:10