byexamples / byexample

Write snippets of code in C++, Python, Ruby, and others as documentation and execute them as regression tests.
https://byexamples.github.io/byexample/
GNU General Public License v3.0
54 stars 8 forks source link

Use a Thompson NFA instead of the Python default re (regex) engine if it is possible #16

Open eldipa opened 6 years ago

eldipa commented 6 years ago

As most of the languages, interpreters, libs, and most of anything out there, Python uses a classical engine that can have an exponential performance in the worst case (known as 'pathological cases').

A non-deterministic finite automata (NFA) can be used to represent any regular expression (without backrefereces). The use of this to implement a regular expression engine is known as the Thompson NFA algorithm.

It is used in the well known grep and awk implementations and it has a linear performance.

The idea is to replace the implementation of re.compile and the subsequent calls to the re module and objects by a NFA implementation. However, as said before, the NFA will be possible only if the regex has no backreferences. Byexample uses the backreferences when a given captured tag with label foo appears twice or more times in the expected string. In those cases the byexample must fall back to the re module.

The Thompson NFA must not introduce any new mandatory dependency but it can require an optional one.

eldipa commented 6 years ago

Interesting reading: https://swtch.com/~rsc/regexp/regexp1.html

eldipa commented 6 years ago

I couldn't find a 3rd party lib for implementing that:

jilljenn commented 6 years ago

Can you offer https://xysun.github.io/posts/regex-parsing-thompsons-algorithm.html to join this discussion? 😄

eldipa commented 6 years ago

It's good to see that the spirit of Thompson is still alive!

Yes, indeed, if a NFA is needed I will ask him to give me a hand on this.

For now, it is not super urgent: byexample only uses some variants of the .* regex and with this, I implemented a quite simple linear algorithm which is fair enough.

But if in the future we want to add more complex regexs like \d{2,5} at least now I know where I can start :D

Nice!