Open intgr opened 3 years ago
A few example rules that fail:
\b(?<![A-Z][a-z]{0,99}\s{1,9})soley\b
\b([oO])uts(ed|ing(?<!could\s+outsing))\b
\b[dD]omini?ci?an(s)?\b(?<!Dominicans?)
\b([iI])dae(s)?\b(?<!\b\p{Lu}\.\s+idae)(?!'')
(?<=\b[lL]a\s{1,9})\b([rR])ebelion\b
(?=ac)(?<=\b(?:[A-Z][a-z]*|[a-z]+))act?iy\b
(?=fuly)(?<=\b(?:[A-Z][a-z]*|[a-z]+))fuly\b
(?=ivly)(?<=\b(?:[A-Z][a-z]*|[a-z]+))ivly\b
\b([pP])retect([a-z]*)\b(?<!tect(?:al|o|um))
\b(?<=\s)(?<!\|\s{0,9})([gG]pu)\b(?![^\s\.]*\.\w)
Hey! Hm that's an interesting use case :).
So the current implementation of how to match look-behinds is to go back the exact number of characters in the string and then match the expression in the look-behind from there. For that to work, the expression must be a constant size (so not contain ?
, *
, +
or even {n,m}
. Note e.g. oniguruma has the same limitation.
There is a good explanation of what various engines support here: https://www.regular-expressions.info/lookaround.html#limitbehind
And the Wikipedia page also says that depending on which tool you use some rules might be skipped: https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Typos#Usage
The most generic implementation that would allow all kinds of expressions is to actually match the look-behind backwards (in reverse). It's not planned to implement that at the moment.
The new regex-automata
crate exposes many lower-level internals of the regex crate. Related blog article: https://blog.burntsushi.net/regex-internals/
It seems they expose the reverse matching capability in some engines as well. Possibly this can be used to add support for simpler look-behind assertions without the constant length limitation?
I've been playing around with Wikipedia's RegExTypoFix project: https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Typos
While most of the rules compile with fancy-regex, there are a few hundred that fail with error "Look-behind assertion without constant size", you can try it out at https://github.com/intgr/topy-rs
Not sure how difficult the implementation would be.