aiidateam / qe-tools

A set of useful tools for Quantum ESPRESSO
MIT License
29 stars 14 forks source link

Exponentially-slow regex #15

Open giovannipizzi opened 6 years ago

giovannipizzi commented 6 years ago

As discussed in e.g. these links [1,2] there are some regex that, with the default python (or perl, ...) implementation, can become exponentially slow with some inputs.

To reproduce, one can e.g. run the following, where cell_parameters_block_re is the one defined in qeinputparser.py:

cell_string = """CELL_PARAMETERS angstrom
1.000000000 0.00000000 0.000000000
0.000000000 1.00000000 0.000000000
0.000000000 0.00000000 1.000000000"""
print cell_parameters_block_re.match(cell_string) # does not end after >60seconds
print cell_parameters_block_re.match(cell_string + "\n") # ends instantaneously

Already using a Thompson NFA implementation improves the situation: e.g. I tried this from this BitBucket repo with pip install regex and did a import regex as re as a drop-in replacement for re when generating cell_parameters_block_re can alleviate the problem, bringing the time down to ~3sec. This can be done, adding a dependency, still it's not perfect and probably our regex can be improved.

At the moment, I suggest to always append a newline at the end of the input for safety, and if we see that this is not enough, we open this issue again and replace re with regex (and/or improve the regex).

[1] https://stackoverflow.com/questions/25982466/why-does-this-take-so-long-to-match-is-it-a-bug [2] https://swtch.com/~rsc/regexp/regexp1.html

Mentioning @lekah

giovannipizzi commented 6 years ago

To test (fast version, with newline):

# No newline in this string
cell_string = """CELL_PARAMETERS angstrom
1.000000000 0.00000000 0.000000000
0.000000000 1.00000000 0.000000000
0.000000000 0.00000000 1.000000000"""

from qe_tools.parsers.qeinputparser import parse_cell_parameters
print(parse_cell_parameters(cell_string + '\n'))
giovannipizzi commented 6 years ago

At the moment, we added a 'workaround' that solves the problem in a known case. Please report here any further occurrences of the problem - if we have no news in the next few months we can close this issue.