AnyDSL / thorin2

The Higher ORder INtermediate representation - next gen
https://anydsl.github.io/thorin2/
MIT License
46 stars 9 forks source link

Implement codegen from regex DFA. #254

Closed fodinabor closed 10 months ago

fodinabor commented 10 months ago

For this, LowerRegex first creates a NFA from the Regex, then converts this to an minimized DFA. Finally, this DFA is emitted as Thorin code again (see dfa2matcher).

This allows us to match non-deterministic regex such as (a|b)+a correctly as well, which was a pretty strong downside of the old impl. Note, we still don't support capture groups and the like, and also, we lose support for not although that might be reimplementable.. Also, we lost the showcase of using the annexes as implementation.

The regex->min dfa infra follows the algorithms from https://doi.org/10.1007/978-3-642-17540-4 and hopcroft from https://en.wikipedia.org/wiki/DFA_minimization.

A super quick&dirty benchmark seems favorable :) (note how the old impl was slower than manual and the new one is faster.. :))

old: engine average[us] min[us] max[us] deviation[%]
thorin 1069 1055 1096 3
ctre 4640 4606 4696 1
pcre2 3723 3603 3870 7
manual 843 821 878 6
pcre2_jit 1233 1210 1269 4
std 9692 9607 9819 2
new: engine average[us] min[us] max[us] deviation[%]
ctre 4663 4656 4709 1
pcre2 3769 3645 3848 5
manual 865 849 904 6
thorin 738 725 762 5
std 10227 10171 11174 9
pcre2_jit 1224 1211 1258 3