For certain regexes, on certain datasets, running that regex appears to take an
amount of time that increases exponentially with the length of input, in
comparison to the 're' module, which completes the same task in much more
constant time.
An example of a regex causing this is:
(?:\b|3)([0-9]{3})
Some example text that causes slow searching is a single line of text that
contains repeating '11 ', e.g.:
11 11 11 11 11
This occurs with a lot of other data and variations on the regex, this is just
a simple case that illustrates it.
The time taken to perform this regex search based on the number of characters
in the text is show below:
regex module
10000 chars: 0.17s
20000 chars: 0.33s
30000 chars: 0.60s
40000 chars: 0.96s
50000 chars: 1.44s
100000 chars: 5.40s
200000 chars: 21.10s
300000 chars: 47.30s
re module
10000 chars: 0.027s
20000 chars: 0.027s
30000 chars: 0.027s
40000 chars: 0.028s
50000 chars: 0.030s
100000 chars: 0.034s
200000 chars: 0.041s
300000 chars: 0.051s
This was tested using regex 2013-02-23, python 2.6, running on CentOS 6.3.
Some sample code which demonstrates this:
import regex as re
compiled_re = re.compile(r'(?:\b|3)([0-9]{3})')
text = '11 ' * 33333
compiled_re.search(text)
Original issue reported on code.google.com by sui...@gmail.com on 11 Mar 2013 at 12:11
Original issue reported on code.google.com by
sui...@gmail.com
on 11 Mar 2013 at 12:11