fancy-regex / fancy-regex

Rust library for regular expressions using "fancy" features like look-around and backreferences
MIT License
424 stars 37 forks source link

Support look-behind assertions without constant size #74

Open intgr opened 3 years ago

intgr commented 3 years ago

I've been playing around with Wikipedia's RegExTypoFix project: https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Typos

While most of the rules compile with fancy-regex, there are a few hundred that fail with error "Look-behind assertion without constant size", you can try it out at https://github.com/intgr/topy-rs

Not sure how difficult the implementation would be.

intgr commented 3 years ago

A few example rules that fail:

robinst commented 3 years ago

Hey! Hm that's an interesting use case :).

So the current implementation of how to match look-behinds is to go back the exact number of characters in the string and then match the expression in the look-behind from there. For that to work, the expression must be a constant size (so not contain ?, *, + or even {n,m}. Note e.g. oniguruma has the same limitation.

There is a good explanation of what various engines support here: https://www.regular-expressions.info/lookaround.html#limitbehind

And the Wikipedia page also says that depending on which tool you use some rules might be skipped: https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Typos#Usage

The most generic implementation that would allow all kinds of expressions is to actually match the look-behind backwards (in reverse). It's not planned to implement that at the moment.

intgr commented 1 year ago

The new regex-automata crate exposes many lower-level internals of the regex crate. Related blog article: https://blog.burntsushi.net/regex-internals/

It seems they expose the reverse matching capability in some engines as well. Possibly this can be used to add support for simpler look-behind assertions without the constant length limitation?