Closed slevithan closed 2 days ago
Hey, thanks for bringing this up, I am totally up to collaborate! I also found regex
early today on Twitter.
regex
seems to be a lot more thoughtful and robust in many ways, while my lib is mostly trying to get Shiki (TextMate grammars) working with JavaScript.
I tried to replace oniguruma-to-js
with regex
in Shiki, but it seems not producing expected result. Maybe I used it wrong?
For oniguruma-to-js
, I wanted it to be fast and lightweight as much as possible, as my goal is to correctly transpile the regexes in Shiki's TextMate grammars in order to replace Onigumra's WASM entirely.
Cool! I'm also totally up to collaborate if/where it makes sense! regex
is totally aligned on the goals of being fast and lightweight.
I'll say up front though that it's possible regex
isn't the right fit since its goal is modernizing JS regexes (although I'd likely be happy to adopt features that don't interfere with this goal while making it more broadly useful, in libraries like this one). One way it achieves this goal is by being a strict superset of JS regex syntax, with support for all ES2025 regex features. But Oniguruma isn't a strict superset of JS (lots of big and small differences). So regex
could be very useful for quickly improving beyond the current version of oniguruma-to-js
, but it might end up not being the right foundation if the goal is to strive for full Oniguruma compatibility (at least where that matters to get to 100% support for Shiki's TextMate grammars, since this goal is actually impossible if outputting native JS regexes).
I tried to replace
oniguruma-to-js
withregex
in Shiki, but it seems not producing expected result. Maybe I used it wrong?
You used it right! You're getting a bunch of "Invalid escape" and "Invalid character in character class" errors because regex
uses JS's flag v
syntax, and flag v
has particular/strict escaping rules within character classes. regex
always uses either flag v
or u
(depending on whether your environment natively supports flag v
), but it uses flag v
escaping rules in both cases, since v
is a best practice for all new JS regexes, so by using v
syntax, regex
simplifies down to the one most-modern parsing rules for regex syntax, rather than maintaining native JS's 4 (!) different parsing rules depending on flags and features used.
Extending from the setup in your linked diff, you can change regex
to not use flag v
's escaping rules like this:
const re = regex({
flags: 'dg',
subclass: true,
unicodeSetsPlugin: null,
disable: {
x: true,
n: true,
v: true,
},
})({ raw: [p] })
This adds disable: {v: true}
to switch to flag u
, and prevents applying flag v
's escaping rules via unicodeSetsPlugin: null
. These changes make it work for some but not all of the currently failing regexes in the diff you linked to. Some still throw because they escape characters that flag u
(with its strict errors for unreserved escapes) doesn't allow you to escape. Allowing those would require using "Unicode-unaware" mode (no flag u
or v
), which is not great because it would drop support for \p{...}
among other regex features and safeguards. BUT, support for needless escapes can easily be modded in with a simple plugin, like this:
regex({
flags: 'dg',
subclass: true,
plugins: [str => str.replace(/(\\\\)|\\(['"`#])/g, '$1$2')],
unicodeSetsPlugin: null,
disable: {
x: true,
n: true,
v: true,
},
})({ raw: [p] })
The plugin used above un-escapes the characters '
, "
, `
, and #
. You could easily add more, but these were the needlessly escaped characters I noticed in the diff you linked to that would throw with flag u
. It handles them whether they are inside or outside of character classes, and it correctly accounts for escaped backslashes that might precede these characters. I tested this with a bunch (but not all) of the regexes in your linked diff, and it worked with all those I tested it with.
Edit: I've built on the above code in a comment on the relevant Shiki PR.
Aside: Note that oniguruma-to-js
's current strategy of neutering possessive quantifiers and atomic groups (by turning them into greedy quantifiers and noncapturing groups) is quite dangerous. In addition to how in some cases it might change which strings are matched (which you already acknowledge within comments), in other cases it might pass tests but lead to ReDoS with certain inputs, which could hang people's servers, etc. In fact, this is quite likely, since protecting against this outcome with certain inputs is probably why the regex author used an atomic group or possessive quantifier in the first place.
I would therefore advise against this. IMO it's much better to just not support such regexes than to make them dangerous. But using regex
allows you to support them since it has perfect/complete support for possessive quantifiers and atomic groups, which is nontrivial to do correctly with native JS regexes (since to emulate them you have to add capturing groups along with other tricks, so then you need to also rewrite following numbered backreferences in the pattern and use a RegExp
subclass to not let these "emulation groups" alter backreference values in replacement strings/functions, etc.).
Yeah, I agree. Would regex
support a processor version/utility (return source string instead of RegExp
instance) where I can then do further process and contract the RegExp on my own (as you mention in https://github.com/shikijs/shiki/pull/762#issuecomment-2323038293 about (?i:
etc, so I can workaround it)
Would
regex
support a processor version/utility (return source string instead ofRegExp
instance) where I can then do further process and contract the RegExp on my own
That's a great idea, and yes. Supporting an option to return a string instead of a regex is totally doable. (Tentative option name returnString
or processOnly
, but I'd love better suggestions.)
I'm traveling at the moment but will try to add such a feature in the next few days. There are a few things I need to review as part of adding this to ensure it works well.
Question: Do TextMate grammars ever do search and replace with backreferences in a user-provided replacement string? If not (and regexes are only used to search), this would be much easier. If yes, regex
's automatic rewriting of numbered backreferences could create an issue (solvable but with more complexity).
Do you aim to have a single exported API for regex
? It's a great way to limit the public interface for sure. But from my point of view, I care more about tree shaking and composability. Exposing features via options through a single entry point makes everything non-treeshakable, meaning that even if a user doesn't need all the features it provides, they still need to pay the cost of the full package - do we consider expose some functions, like function convertRegex(input: string): string
?
It's it's a bit awkward to use regex({})({ raw: [pattern] })
then a function version createRegex(pattern, {})
- as in our cases, we don't need the string template functionality either.
Question: Do TextMate grammars ever do search and replace with backreferences
I am not very sure if they would have backreferences or not. But for Shiki it only needs regex.match()
and get the indices of each group.
Thanks a lot for all the information and the willingness to collaborate, even though my usage can be a bit tricky. Enjoy your travel! We can talk more about it after that.
Good feedback. I've just published a new version of regex
that includes the new function rewrite(str, options)
. It returns an object with two properties (expression
, flags
), and if you passed those properties to RegExp
you'd get the same result as using the regex
tag. rewrite
doesn't give you context-aware interpolation and isn't a raw string template, but your use case doesn't need those things anyway. It supports all of regex
's options except subclass
, so you could even use plugins with it if you wanted.
I'd recommend something like this:
import { rewrite } from 'regex'
str = rewrite(str, {
flags: 'dg',
unicodeSetsPlugin: null,
disable: {
n: true,
v: true,
x: true,
},
}).expression
This leaves the pattern alone apart from transforming atomic groups, possessive quantifiers, subroutines, and subroutine definition groups.
Differences from Oniguruma (all are edge cases):
?+
, *+
, and ++
are possessive, {N}+
is not (same for {N,}+
and {N,N}+
). 😢 Instead, the +
confusingly quantifies the {N}
quantifier, so there are two greedy quantifiers in a row ({2}+
means repeat 2, 4, 6, 8, ... times). This is a quirk of Oniguruma that doesn't follow the behavior and prior art of possessive {N}+
in Perl, PCRE, Java, Python, C++, etc.
regex
plugin) to support the Oniguruma logic for {N}+
, etc.\g<name>
don't restore capturing groups upon exiting the subroutine (unlike subroutines in PCRE and Perl). This means that subsequent \k<name>
backrefereces to such groups (or to named groups defined within the group that the subroutine references) will work like PCRE and not like Oniguruma.
regex
doesn't support the following subroutine features (they will throw if encountered):\g<N>
.If you wanted, you could disable subroutine and subroutine definition group processing via option disable: {subroutines: true}
. That would also prevent the errors when encountering unsupported subroutine features.
Clear and concise description of the problem
Great project! Idea: Building on
regex
(which has some overlap; it's a library for extending native JS syntax and emitting native JS regexes) would give you free support for several complex features that are hard to reimplement robustly. Additionally,regex
has an existing and easy to use plugin system that would allow extending it with additional Oniguruma features.Suggested solution
Extended Oniguruma syntax and flags supported by
regex
:x
. On by default, but can be disabled via an option. The handling is more robust/correct than oniguruma-to-js's current handling since whitespace and comments still separate tokens (e.g. with\0 1
), etc. See the collapsed details under "Show more details" here, but note that this is modeled on PCRE/Perl's flagxx
and doesn't work the same as Oniguruma within character classes.\g<name>
. This is the same syntax as Oniguruma, but note that it's modelled on PCRE/Perl subroutines which work a little differently (better) than Oniguruma for edge cases related to things like backreferences to groups defined within the subroutine. But for probably 95% of use, it works the same.Extended flags and syntax supported by
regex
but not by Oniguruma:n
("named capture only" mode). On by default, but can be disabled via an option. This makes(...)
noncapturing, and additionally makes numbered backreferences to named groups an error (which Oniguruma does even without flagn
).Existing oniguruma-to-js features not supported by
regex
:v
and emitting them as nested classes, they would correctly error when they're on a range boundary, and correctly be treated as a set if on a set subtraction/intersection boundary.\h
. Again trivial to add via a plugin, with support both inside and outside character classes, and correctly handling them as an error when on a character class range or set operation boundary.regex
also already supports this via interpolation ofRegExp
instances that have different flags than the outer regex.If you think building on
regex
could be a good fit, it would also be possible to extend theregex
library with new options that would make its features work more similarly to Oniguruma/Onigmo.Alternative
No response
Additional context
No response
Validations