LeaVerou / brep

Write batch find & replace scripts that transform files with a simple human-readable syntax
12 stars 0 forks source link

Add `regex` as a flavor option for improved DX #17

Open slevithan opened 1 month ago

slevithan commented 1 month ago

Love the project. 😊

A significant downside of this project being written in JavaScript is that JS regexes, with any significant level of complexity, are awful with readability and maintainability compared to other major regex flavors (due to the lack of options for free spacing, comments, and any features for composing/reusing subpatterns). They're also especially vulnerable to catastrophic backtracking due to the lack of backtracking control syntax.

As a result, I think my recently released regex package could be a great fit to include with brep, either as an option or even as the default regex handler. It uses a strict superset of native JS regex syntax, so, apart from the differences that result from its implicit flags x and n, 100% of JS regexes (with flag v) work the same. If it wasn't used as the default, perhaps there could be a new option like regexp_flavor with values 'RegExp' (native) and 'regex'.

From the regex package's readme:

regex is a template tag that extends JavaScript regular expressions with features that make them more powerful and dramatically more readable. It returns native RegExp instances that equal or exceed native performance. It's also lightweight, supports all ES2024+ regex features, and can be used as a Babel plugin to avoid any runtime dependencies or added runtime cost.

Highlights include support for free spacing and comments, atomic groups via (?>…) that can help you avoid ReDoS, subroutines via \g<name> and definition groups via (?(DEFINE)…) that enable powerful composition, and context-aware interpolation of regexes, escaped strings, and partial patterns.

With the regex package, JavaScript steps up as one of the best regex flavors alongside PCRE and Perl, and maybe surpassing C++, Java, .NET, and Python.

What do you think?

Note: Since brep would need to call regex with dynamic input rather than using it with backticks as a template tag, that would work like this: regex({raw: [<pattern-string>]}) or regex(<flags-str>)({raw: [<pattern-string>]}).

LeaVerou commented 1 month ago

Oh I love this idea! Introducing a regex dependency could also simplify some of our existing regex handling code.

WRT the API, rather than adding a new property, I think this could be a value for the regexp property:

Downside is that regexp: 'regex' looks very clumsy lol.

LeaVerou commented 1 month ago

It uses a strict superset of native JS regex syntax, so, apart from the differences that result from its implicit flags x and n, 100% of JS regexes (with flag v) work the same. If it wasn't used as the default, perhaps there could be a new option like regexp_flavor with values 'RegExp' (native) and 'regex'.

Just read this more carefully. I think having it as a default could work (with regexp: "js" as an escape hatch, which would work as described above).

Though I just realized, you probably want to set the regexp flavor for the whole file, but not necessarily turn every replacement into a regex replacement. So perhaps there should also be a regexp_flavor property? 🤔 Or optionally an object literal: regexp: {flavor: "js", enabled: false}.


Important question: Does regex work in the browser via ESM? I see only a version using a global variable.

slevithan commented 1 month ago

Awesome!

Introducing a regex dependency could also simplify some of our existing regex handling code.

I'd be happy to help with this after the dependency is introduced!

Just read this more carefully. I think having it as a default could work (with regexp: "js" as an escape hatch, which would work as described above).

I think this is a great idea. People would need the escape hatch in only three cases:

Of course, instead of using the escape hatch, they could switch to named backreferences and the standard ways of escaping whitespace and '#' that work in every regex flavor that supports flag x.

Though I just realized, you probably want to set the regexp flavor for the whole file

Since you need the regexp: true line for every regex replacement anyway, I don't see the harm in always setting regexp: "js" instead in every case, if that's what you want. Mixing flavors seems plausible as a desired feature. I can see it being used when copying and pasting existing JS regexes into a brep file, if you don't want to think about any changes for that particular regex.

That said, your other ideas about a regexp_flavor option or an object literal seem good. I especially like regexp_flavor: "js" since that's more explicit and therefore possibly easier to understand than regexp: "js". And it allows easily introducing a mechanism for setting the default regexp_flavor value for the whole file, in a way that's easy to understand what's going on.


Edit: If there's a way to set the default regex flavor for the file, then if for whatever reason you end up deciding to make "js" the default flavor instead, it would still be easy to switch the file into the "regex" flavor (maybe using a value like "lib:regex" if that's less clumsy) in cases when developers want to use more complex regexes with free-spacing, subroutines, atomic groups, etc.

slevithan commented 1 month ago

Important question: Does regex work in the browser via ESM? I see only a version using a global variable.

Happy to add an ESM bundle in the next version. I expect to publish it before the end of this week.

I'm not sure of the best way to do this since I don't have experience running ESM in the browser, but I'm thinking to use esbuild to create a bundle dist/regex.min.mjs. Any suggestions are very welcome.

LeaVerou commented 1 month ago

I think this is a great idea. People would need the escape hatch in only three cases:

  • If they use numbered backreferences (which throw with regex due to the implicit flag n).
  • If they use literal whitespace (which is not matched due to the implicit flag x).
  • If they use a literal '#' outside of character classes (since this starts a comment due to the implicit flag x).

Of course, instead of using the escape hatch, they could switch to named backreferences and the standard ways of escaping whitespace and '#' that work in every regex flavor that supports flag x.

Oof, depending on numbered capturing groups is pretty common, many examples in the docs even use that. Whitespace and# is a bit less of an issue, but could still be surprising if default. Is there any way to control these three features separately? E.g. to get regex to support numbered capturing groups so one could still take advantage of whitespace and comments?

Since you need the regexp: true line for every regex replacement anyway, I don't see the harm in always setting regexp: "js" instead in every case, if that's what you want.

That seems pretty tedious and fragile. :/

Mixing flavors seems plausible as a desired feature. I can see it being used when copying and pasting existing JS regexes into a brep file, if you don't want to think about any changes for that particular regex.

Oh absolutely. But it should not get in the way of the simple case, which is setting the flavor for the whole file...

slevithan commented 1 month ago

But it should not get in the way of the simple case, which is setting the flavor for the whole file...

Absolutely agree.

Oof, depending on numbered capturing groups is pretty common, many examples in the docs even use that. Whitespace and # is a bit less of an issue, but could still be surprising if default. Is there any way to control these three features separately? E.g. to get regex to support numbered capturing groups so one could still take advantage of whitespace and comments?

Yeah, these are real concerns if it was made the default flavor. The easy way out is of course to make it an opt-in flavor (with native "js" as the default), which is not so bad if you can easily set a default flavor for the file. But in case making regex the default flavor is a salvageable idea, let's explore this...

Flag x

I intentionally hide the implicit flags x and n behind obscure options (which are documented as intended only for debugging), but it would be very easy for brep to turn them off. For flag x:

// whitespace and `#` treated literally
regex({flags: 'gm', __flagX: false})`...`

// or for dynamic values without using backticks:
regex({flags: 'gm', __flagX: false})({raw: [pattern]});

This is totally fine/safe to do if you wanted to make flag x an opt-in feature like the ignore_case option. But then, shielding it behind an option would make people at least somewhat less likely to use free-spacing.

Flag n

This can be turned off the same way:

// numbered backreferences allowed, and `(...)` captures instead of just grouping
regex({__flagN: false})`...`;

....but, turning flag n off is not safe unless you avoid atomic groups and subroutines. The reason is that emulating an atomic group while still emitting a native RegExp instance requires adding a capturing group to the regex. Using subroutines (via \g<name>) also adds anonymous captures if there are any backreferences in the regex. This is not a problem for the regex library itself since it rewrites backreferences to point them to the correct groups. However, if any captures are added to the output this way, then numbered backreferences that come after them will no longer point to the correct groups on match arrays or in replacement strings (via .replace/.replaceAll). So, to avoid the issue, I leaned into flags x and n always being on.

Based on this conversation, I am now planning to add an __extended_syntax: false (or similar) option in the soon-coming next release. If you wanted, you could then safely leave flag x enabled (giving you free-spacing and comments) while disabling flag n and extended syntax (allowing anonymous captures and numbered backreferences to work, and be safe to use).

Flag n as a best practice

Aside: Flag n being always on has several benefits in addition to allowing safe emulation of atomic groups and subroutines:

Path forward for brep with regex

In the regex library, I treat the implicit flags x and n as enforcing best practices. But I see why this would be problematic (especially flag n) for brep, unless you also adopted the opinionated stance of enforcing best practices.

If I were in your shoes as the author of brep, I think I might do the following:

An alternative option might be:

LeaVerou commented 1 month ago

Wow thanks for such a detailed writeup! I love it!

One thing I’ve been wondering about is framing it in terms of capabilities rather than flavors. E.g. toggling comments, better whitespace handling, etc rather than selecting a flavor, which seems a bit like exposing an implementation detail to users. Then brep could pick a suitable library (JS or regex or just use regex at all times with suitable flags) based on the capabilities needed. But this only works if the user needs are around capabilities — e.g. if a user need is to select a completely different regex syntax, that actually entails exposing a flavor option.

....but, turning flag n off is not safe unless you avoid atomic groups and subroutines.

Is this a fundamental restriction or an implementation bug? Btw I would have expected that turning flag n off also turns these off and either matches them literally (if a fundamental restriction) or throws (if an implementation bug, so that people don't depend on literal matching and you could support this in the future).

Thinking out loud, I wonder if one way to lift this restriction could be to rewrite all would-be numbered groups into groups with a given prefix (that doesn’t clash with other groups in the regex), and rewrite any backreferences accordingly. Then, things should Just Work™ within the regexp, though require special handling when the regex is used. That could be addressed by subclassing RegExp and returning an object that does this automatically, though I would understand if it’s too much work and not worth it.

slevithan commented 1 month ago

(Oops, accidentally closed the issue and posted an incomplete comment that I've since deleted.)

One thing I’ve been wondering about is framing it in terms of capabilities rather than flavors. E.g. toggling comments, better whitespace handling, etc rather than selecting a flavor, which seems a bit like exposing an implementation detail to users. Then brep could pick a suitable library (JS or regex or just use regex at all times with suitable flags) based on the capabilities needed. But this only works if the user needs are around capabilities — e.g. if a user need is to select a completely different regex syntax, that actually entails exposing a flavor option.

Treating the flavor as an implementation detail is almost always a detriment to a tool's advanced users. The reason is that regex flavors tend to compose innumerable small syntax and behavior differences, in addition to the more obvious large differences. And there is a ton of precedent in other tools for not treating the flavor as an implementation detail. As just one example, ripgrep uses Rust regexes by default, but allows switching to PCRE2 regexes via option -P, --pcre2, or --engine pcre2.

BUT, regex is a special case. Since it produces native RegExp instances and goes out of its way to support 100% of JS regex syntax in exactly the same way (including for in-progress TC39 regex proposals that only work in some browsers, like pattern modifiers and duplicate capture names), it is in fact safe to treat the use of regex as an implementation detail. However, I still wouldn't personally do so because having to opt into each additional feature would discourage their use, and it would additionally make it harder to support additional flavors in the future (which almost certainly would not be best treated as implementation details). For example, maybe you'll eventually want to support Rust regexes via the rregex WASM build of Rust's regex crate. But note: unlike regex, the rregex library is both very large and very slow (compared to native JS regexes).

....but, turning flag n off is not safe unless you avoid atomic groups and subroutines.

Is this a fundamental restriction or an implementation bug?

It's fundamental.

Btw I would have expected that turning flag n off also turns these off and either matches them literally (if a fundamental restriction) or throws (if an implementation bug, so that people don't depend on literal matching and you could support this in the future).

Automatically disabling support for extended syntax when flag n is disabled is a good idea--I'll have to think more about that. I hadn't considered that so far because, until now, the __flagN option was intended only for use in tests and debugging. This idea of using the __flagN and __flagX options in tooling like brep is making me reconsider these options as core features.

Minor note: Turning off support for extended syntax would always lead to throwing when the syntax is used (there is no option for matching it literally), since all of the new syntax is invalid in JS without explicit support for it.

Thinking out loud, I wonder if one way to lift this restriction could be to rewrite all would-be numbered groups into groups with a given prefix (that doesn’t clash with other groups in the regex), and rewrite any backreferences accordingly. Then, things should Just Work™ within the regexp, though require special handling when the regex is used. That could be addressed by subclassing RegExp and returning an object that does this automatically, though I would understand if it’s too much work and not worth it.

You've got all the right ideas here. 😊 Regarding rewriting captures and using unique prefixes, yeah, as you mentioned, unless you also subclassed RegExp and replaced the subclass's exec, @@replace, and @@species, you would pollute the groups object of matches, wouldn't be able to reference the rewritten capturing groups outside of the regex, and wouldn't be able to safely copy the generated regexes.

In fact, regex already does one better by renumbering any numbered backreferences in the pattern as needed, so no capture-name-prefixes are needed, and as a result there is no pollution of groups, no need to rewrite capturing groups, and no problems with using the RegExp constructor to copy a generated regex.

But, even with the renumbering of backreferences to make everything within the pattern work automatically, there is still the issue of numbered submatches on match arrays and numbered backreferences in .replace/.replaceAll. Since flag n is always on, this is not a meaningful issue today, since all captures have names, and people are expected to therefore always reference them by their names (and, within the pattern itself, they are prevented from referencing them by number).

This entire issue could be solved by, like you suggested, having regex return a RegExp subclass with all of the needed extensions. But the reason I haven't done so thus far has less to do with the amount of work involved, and more to do with this Stage 1 proposal for restricting subclassing support in built-in methods. If that proposal were accepted/implemented, it would break this functionality, and the regex library would end up in a bad place. 😖 So, at least for now, I would prefer not to go down this path, unless/until there is more clarity on whether the proposal will advance (which might be a long time coming).

That leaves us in the situation that it is safe to use flag x piecemeal, but not safe to enable extended syntax (specifically, atomic groups and subroutines) without flag n also being enabled.

slevithan commented 1 month ago

Aside: Thinking more about subclassing, one thing I can use it safely for (even if subclassing of RegExp is restricted in the future), is to make it so that if a regex is generated with flag n enabled, any attempt to reference its named captures by number in any way from outside of the regex (e.g. in replacement strings) will throw. Then, if the ability to do this is restricted by JS in the future, it would no longer throw in these edge cases, but everything else would work the same.

There would be a number of complexities and touchpoints to cover (though ES6 proxies should help), so I'll have to think about the tradeoffs. But I'll consider adding this in the next major version. And any feedback would of course be welcome.


PS: None of this complexity applies to brep. Since brep has control over the replacement function (without the need for subclassing or supporting every edge case of the JS language), it should be very simple if you wanted brep to throw when (1) using the regex flavor, (2) flag n is enabled, and (3) named captures are referenced by number in the replacement string. I'd be happy to contribute such a change.

slevithan commented 1 month ago

regex 3.1.0 is now live, and among other improvements, it includes two additions inspired by this convo:

slevithan commented 1 month ago

regex 4.0.0 is now live, and this release specifically focused on improving and simplifying regex's API for advanced options (which should make it easier for brep to adopt).

As a result, a lot of what I said above is now out of date. 😎

A few specific things:

Assuming you're still interested 😊 I'd be happy to help with any PRs, but it's probably best for you to decide up front about exactly how to include regex and expose it to users.

LeaVerou commented 1 month ago

This is amazing! My bandwidth is low right now because my PhD defense is coming up in a few days, but I'm definitely still interested. Need to mull over the API though to be novice-friendly without causing problems for more advanced users.