antfu / oniguruma-to-js

Covert Oniguruma-flavor Regexp to JavaScript native RegExp.
MIT License
60 stars 1 forks source link

Thoughts on building on top of the `regex` library? #1

Closed slevithan closed 2 days ago

slevithan commented 2 weeks ago

Clear and concise description of the problem

Great project! Idea: Building on regex (which has some overlap; it's a library for extending native JS syntax and emitting native JS regexes) would give you free support for several complex features that are hard to reimplement robustly. Additionally, regex has an existing and easy to use plugin system that would allow extending it with additional Oniguruma features.

Suggested solution

Extended Oniguruma syntax and flags supported by regex:

Extended flags and syntax supported by regex but not by Oniguruma:

Existing oniguruma-to-js features not supported by regex:

If you think building on regex could be a good fit, it would also be possible to extend the regex library with new options that would make its features work more similarly to Oniguruma/Onigmo.

Alternative

No response

Additional context

No response

Validations

antfu commented 2 weeks ago

Hey, thanks for bringing this up, I am totally up to collaborate! I also found regex early today on Twitter.

regex seems to be a lot more thoughtful and robust in many ways, while my lib is mostly trying to get Shiki (TextMate grammars) working with JavaScript.

I tried to replace oniguruma-to-js with regex in Shiki, but it seems not producing expected result. Maybe I used it wrong?

https://github.com/shikijs/shiki/pull/762/files?short_path=245c1a2#diff-245c1a2a1d3ff31538c571eb7f586a6838013a7193aed1bd4c7c468ef45a0f83

For oniguruma-to-js, I wanted it to be fast and lightweight as much as possible, as my goal is to correctly transpile the regexes in Shiki's TextMate grammars in order to replace Onigumra's WASM entirely.

slevithan commented 2 weeks ago

Cool! I'm also totally up to collaborate if/where it makes sense! regex is totally aligned on the goals of being fast and lightweight.

I'll say up front though that it's possible regex isn't the right fit since its goal is modernizing JS regexes (although I'd likely be happy to adopt features that don't interfere with this goal while making it more broadly useful, in libraries like this one). One way it achieves this goal is by being a strict superset of JS regex syntax, with support for all ES2025 regex features. But Oniguruma isn't a strict superset of JS (lots of big and small differences). So regex could be very useful for quickly improving beyond the current version of oniguruma-to-js, but it might end up not being the right foundation if the goal is to strive for full Oniguruma compatibility (at least where that matters to get to 100% support for Shiki's TextMate grammars, since this goal is actually impossible if outputting native JS regexes).

I tried to replace oniguruma-to-js with regex in Shiki, but it seems not producing expected result. Maybe I used it wrong?

You used it right! You're getting a bunch of "Invalid escape" and "Invalid character in character class" errors because regex uses JS's flag v syntax, and flag v has particular/strict escaping rules within character classes. regex always uses either flag v or u (depending on whether your environment natively supports flag v), but it uses flag v escaping rules in both cases, since v is a best practice for all new JS regexes, so by using v syntax, regex simplifies down to the one most-modern parsing rules for regex syntax, rather than maintaining native JS's 4 (!) different parsing rules depending on flags and features used.

Extending from the setup in your linked diff, you can change regex to not use flag v's escaping rules like this:

const re = regex({
  flags: 'dg',
  subclass: true,
  unicodeSetsPlugin: null,
  disable: {
    x: true,
    n: true,
    v: true,
  },
})({ raw: [p] })

This adds disable: {v: true} to switch to flag u, and prevents applying flag v's escaping rules via unicodeSetsPlugin: null. These changes make it work for some but not all of the currently failing regexes in the diff you linked to. Some still throw because they escape characters that flag u (with its strict errors for unreserved escapes) doesn't allow you to escape. Allowing those would require using "Unicode-unaware" mode (no flag u or v), which is not great because it would drop support for \p{...} among other regex features and safeguards. BUT, support for needless escapes can easily be modded in with a simple plugin, like this:

regex({
  flags: 'dg',
  subclass: true,
  plugins: [str => str.replace(/(\\\\)|\\(['"`#])/g, '$1$2')],
  unicodeSetsPlugin: null,
  disable: {
    x: true,
    n: true,
    v: true,
  },
})({ raw: [p] })

The plugin used above un-escapes the characters ', ", `, and #. You could easily add more, but these were the needlessly escaped characters I noticed in the diff you linked to that would throw with flag u. It handles them whether they are inside or outside of character classes, and it correctly accounts for escaped backslashes that might precede these characters. I tested this with a bunch (but not all) of the regexes in your linked diff, and it worked with all those I tested it with.

Edit: I've built on the above code in a comment on the relevant Shiki PR.

slevithan commented 2 weeks ago

Aside: Note that oniguruma-to-js's current strategy of neutering possessive quantifiers and atomic groups (by turning them into greedy quantifiers and noncapturing groups) is quite dangerous. In addition to how in some cases it might change which strings are matched (which you already acknowledge within comments), in other cases it might pass tests but lead to ReDoS with certain inputs, which could hang people's servers, etc. In fact, this is quite likely, since protecting against this outcome with certain inputs is probably why the regex author used an atomic group or possessive quantifier in the first place.

I would therefore advise against this. IMO it's much better to just not support such regexes than to make them dangerous. But using regex allows you to support them since it has perfect/complete support for possessive quantifiers and atomic groups, which is nontrivial to do correctly with native JS regexes (since to emulate them you have to add capturing groups along with other tricks, so then you need to also rewrite following numbered backreferences in the pattern and use a RegExp subclass to not let these "emulation groups" alter backreference values in replacement strings/functions, etc.).

antfu commented 1 week ago

Yeah, I agree. Would regex support a processor version/utility (return source string instead of RegExp instance) where I can then do further process and contract the RegExp on my own (as you mention in https://github.com/shikijs/shiki/pull/762#issuecomment-2323038293 about (?i: etc, so I can workaround it)

slevithan commented 1 week ago

Would regex support a processor version/utility (return source string instead of RegExp instance) where I can then do further process and contract the RegExp on my own

That's a great idea, and yes. Supporting an option to return a string instead of a regex is totally doable. (Tentative option name returnString or processOnly, but I'd love better suggestions.)

I'm traveling at the moment but will try to add such a feature in the next few days. There are a few things I need to review as part of adding this to ensure it works well.

Question: Do TextMate grammars ever do search and replace with backreferences in a user-provided replacement string? If not (and regexes are only used to search), this would be much easier. If yes, regex's automatic rewriting of numbered backreferences could create an issue (solvable but with more complexity).

antfu commented 1 week ago

Do you aim to have a single exported API for regex? It's a great way to limit the public interface for sure. But from my point of view, I care more about tree shaking and composability. Exposing features via options through a single entry point makes everything non-treeshakable, meaning that even if a user doesn't need all the features it provides, they still need to pay the cost of the full package - do we consider expose some functions, like function convertRegex(input: string): string?

It's it's a bit awkward to use regex({})({ raw: [pattern] }) then a function version createRegex(pattern, {}) - as in our cases, we don't need the string template functionality either.

Question: Do TextMate grammars ever do search and replace with backreferences

I am not very sure if they would have backreferences or not. But for Shiki it only needs regex.match() and get the indices of each group.


Thanks a lot for all the information and the willingness to collaborate, even though my usage can be a bit tricky. Enjoy your travel! We can talk more about it after that.

slevithan commented 1 week ago

Good feedback. I've just published a new version of regex that includes the new function rewrite(str, options). It returns an object with two properties (expression, flags), and if you passed those properties to RegExp you'd get the same result as using the regex tag. rewrite doesn't give you context-aware interpolation and isn't a raw string template, but your use case doesn't need those things anyway. It supports all of regex's options except subclass, so you could even use plugins with it if you wanted.

I'd recommend something like this:

import { rewrite } from 'regex'
str = rewrite(str, {
  flags: 'dg',
  unicodeSetsPlugin: null,
  disable: {
    n: true,
    v: true,
    x: true,
  },
}).expression

This leaves the pattern alone apart from transforming atomic groups, possessive quantifiers, subroutines, and subroutine definition groups.

Differences from Oniguruma (all are edge cases):

If you wanted, you could disable subroutine and subroutine definition group processing via option disable: {subroutines: true}. That would also prevent the errors when encountering unsupported subroutine features.