Open eldipa opened 6 years ago
Interesting reading: https://swtch.com/~rsc/regexp/regexp1.html
I couldn't find a 3rd party lib for implementing that:
re
modulere
supportsCan you offer https://xysun.github.io/posts/regex-parsing-thompsons-algorithm.html to join this discussion? 😄
It's good to see that the spirit of Thompson is still alive!
Yes, indeed, if a NFA is needed I will ask him to give me a hand on this.
For now, it is not super urgent: byexample
only uses some variants of the .*
regex and with this, I implemented a quite simple linear algorithm which is fair enough.
But if in the future we want to add more complex regexs like \d{2,5}
at least now I know where I can start :D
Nice!
As most of the languages, interpreters, libs, and most of anything out there, Python uses a classical engine that can have an exponential performance in the worst case (known as 'pathological cases').
A non-deterministic finite automata (NFA) can be used to represent any regular expression (without backrefereces). The use of this to implement a regular expression engine is known as the Thompson NFA algorithm.
It is used in the well known grep and awk implementations and it has a linear performance.
The idea is to replace the implementation of
re.compile
and the subsequent calls to there
module and objects by a NFA implementation. However, as said before, the NFA will be possible only if the regex has no backreferences. Byexample uses the backreferences when a given captured tag with label foo appears twice or more times in the expected string. In those cases the byexample must fall back to there
module.The Thompson NFA must not introduce any new mandatory dependency but it can require an optional one.