RunDevelopment / refa

A library for finite automata and regular expressions in the context of JS RegExp
MIT License
20 stars 3 forks source link

Anchor (^$) assertion support #11

Open albert-sun opened 3 years ago

albert-sun commented 3 years ago

I'm using your package as part of research and want to collaborate on adding anchor assertion support. From a group of ~835k regexes, approximately 53% were supported for conversion without anchor support and approximately 93% were supported for conversion when excluding those with anchors. Any thoughts on how I would go about implementing anchor (^$) support?

Cheers and great work on the package.

RunDevelopment commented 3 years ago

I'm currently looking into implementing support for lookarounds via RE AST transformations (#9). When finished, refa will have full support for all JS assertions (^$\b\B) as well as limited support for general lookarounds.

The main problem is that there is no NFA that can express the regex /^foo$/, so I'll use a little trick. The idea is that I reserve one special character to represent the "outside" of the string - let's call this character %. With this trick, I can rewrite the regex: /^foo$/ -> /(?<=%)foo(?=%)/. Since we might end up with lookarounds that go beyond the consumed characters of the string, the transformation will have two option: Either it will make those lookarounds a part of the pattern (/(?<=%)foo(?=%)/ -> /%foo%/) or it will remove them (/(?<=%)foo(?=%)/ -> /foo/).

Since these two transformation steps are independent of each other, you will also be able to just remove all assertions that go beyond. That alone might be enough for your use case.


Also, may I ask what regex set you use? I currently test refa using Prism's 2.5k regexes but it would be nice to have more to test with.