Forever-Young / mrab-regex-hg

Automatically exported from code.google.com/p/mrab-regex-hg
0 stars 0 forks source link

Certain regexes extremely slow compared to re module #89

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
For certain regexes, on certain datasets, running that regex appears to take an 
amount of time that increases exponentially with the length of input, in 
comparison to the 're' module, which completes the same task in much more 
constant time.

An example of a regex causing this is:
(?:\b|3)([0-9]{3})

Some example text that causes slow searching is a single line of text that 
contains repeating '11 ', e.g.:
11 11 11 11 11

This occurs with a lot of other data and variations on the regex, this is just 
a simple case that illustrates it.

The time taken to perform this regex search based on the number of characters 
in the text is show below:

regex module
10000 chars: 0.17s
20000 chars: 0.33s
30000 chars: 0.60s
40000 chars: 0.96s
50000 chars: 1.44s
100000 chars: 5.40s
200000 chars: 21.10s
300000 chars: 47.30s

re module
10000 chars: 0.027s
20000 chars: 0.027s
30000 chars: 0.027s
40000 chars: 0.028s
50000 chars: 0.030s
100000 chars: 0.034s
200000 chars: 0.041s
300000 chars: 0.051s

This was tested using regex 2013-02-23, python 2.6, running on CentOS 6.3.

Some sample code which demonstrates this:

import regex as re

compiled_re = re.compile(r'(?:\b|3)([0-9]{3})')
text = '11 ' * 33333
compiled_re.search(text)

Original issue reported on code.google.com by sui...@gmail.com on 11 Mar 2013 at 12:11

GoogleCodeExporter commented 9 years ago
Fixed in regex 2013-03-11.

It was an optimisation that turned out to cause a performance hit.

Original comment by re...@mrabarnett.plus.com on 11 Mar 2013 at 2:41