Add `regex` as a flavor option for improved DX

LeaVerou / brep

Write batch find & replace scripts that transform files with a simple human-readable syntax

12 stars 0 forks source link

Add `regex` as a flavor option for improved DX #17

Open slevithan opened 1 month ago

slevithan commented 1 month ago

Love the project. 😊

A significant downside of this project being written in JavaScript is that JS regexes, with any significant level of complexity, are awful with readability and maintainability compared to other major regex flavors (due to the lack of options for free spacing, comments, and any features for composing/reusing subpatterns). They're also especially vulnerable to catastrophic backtracking due to the lack of backtracking control syntax.

As a result, I think my recently released regex package could be a great fit to include with brep, either as an option or even as the default regex handler. It uses a strict superset of native JS regex syntax, so, apart from the differences that result from its implicit flags x and n, 100% of JS regexes (with flag v) work the same. If it wasn't used as the default, perhaps there could be a new option like regexp_flavor with values 'RegExp' (native) and 'regex'.

From the regex package's readme:

regex is a template tag that extends JavaScript regular expressions with features that make them more powerful and dramatically more readable. It returns native RegExp instances that equal or exceed native performance. It's also lightweight, supports all ES2024+ regex features, and can be used as a Babel plugin to avoid any runtime dependencies or added runtime cost.

Highlights include support for free spacing and comments, atomic groups via (?>…) that can help you avoid ReDoS, subroutines via \g<name> and definition groups via (?(DEFINE)…) that enable powerful composition, and context-aware interpolation of regexes, escaped strings, and partial patterns.

With the regex package, JavaScript steps up as one of the best regex flavors alongside PCRE and Perl, and maybe surpassing C++, Java, .NET, and Python.

What do you think?

Note: Since brep would need to call regex with dynamic input rather than using it with backticks as a template tag, that would work like this: regex({raw: [<pattern-string>]}) or regex(<flags-str>)({raw: [<pattern-string>]}).

LeaVerou commented 1 month ago

Oh I love this idea! Introducing a regex dependency could also simplify some of our existing regex handling code.

WRT the API, rather than adding a new property, I think this could be a value for the regexp property:

true = turn on regexes and inherit flavor from parent. If on root, turn on regexes and use native JS.
any string = turn on regexes and use that specific flavor. Throw if flavor not recognized.
false = turn off regex matching.

Downside is that regexp: 'regex' looks very clumsy lol.

LeaVerou commented 1 month ago

It uses a strict superset of native JS regex syntax, so, apart from the differences that result from its implicit flags x and n, 100% of JS regexes (with flag v) work the same. If it wasn't used as the default, perhaps there could be a new option like regexp_flavor with values 'RegExp' (native) and 'regex'.

Just read this more carefully. I think having it as a default could work (with regexp: "js" as an escape hatch, which would work as described above).

Though I just realized, you probably want to set the regexp flavor for the whole file, but not necessarily turn every replacement into a regex replacement. So perhaps there should also be a regexp_flavor property? 🤔 Or optionally an object literal: regexp: {flavor: "js", enabled: false}.

Important question: Does regex work in the browser via ESM? I see only a version using a global variable.

slevithan commented 1 month ago

Awesome!

Introducing a regex dependency could also simplify some of our existing regex handling code.

I'd be happy to help with this after the dependency is introduced!

Just read this more carefully. I think having it as a default could work (with regexp: "js" as an escape hatch, which would work as described above).

I think this is a great idea. People would need the escape hatch in only three cases:

If they use numbered backreferences (which throw with regex due to the implicit flag n).
If they use literal whitespace (which is not matched due to the implicit flag x).
If they use a literal '#' outside of character classes (since this starts a comment due to the implicit flag x).

Of course, instead of using the escape hatch, they could switch to named backreferences and the standard ways of escaping whitespace and '#' that work in every regex flavor that supports flag x.

Though I just realized, you probably want to set the regexp flavor for the whole file

Since you need the regexp: true line for every regex replacement anyway, I don't see the harm in always setting regexp: "js" instead in every case, if that's what you want. Mixing flavors seems plausible as a desired feature. I can see it being used when copying and pasting existing JS regexes into a brep file, if you don't want to think about any changes for that particular regex.

That said, your other ideas about a regexp_flavor option or an object literal seem good. I especially like regexp_flavor: "js" since that's more explicit and therefore possibly easier to understand than regexp: "js". And it allows easily introducing a mechanism for setting the default regexp_flavor value for the whole file, in a way that's easy to understand what's going on.

Edit: If there's a way to set the default regex flavor for the file, then if for whatever reason you end up deciding to make "js" the default flavor instead, it would still be easy to switch the file into the "regex" flavor (maybe using a value like "lib:regex" if that's less clumsy) in cases when developers want to use more complex regexes with free-spacing, subroutines, atomic groups, etc.

slevithan commented 1 month ago

Important question: Does regex work in the browser via ESM? I see only a version using a global variable.

Happy to add an ESM bundle in the next version. I expect to publish it before the end of this week.

I'm not sure of the best way to do this since I don't have experience running ESM in the browser, but I'm thinking to use esbuild to create a bundle dist/regex.min.mjs. Any suggestions are very welcome.

LeaVerou commented 1 month ago

I think this is a great idea. People would need the escape hatch in only three cases:

If they use numbered backreferences (which throw with regex due to the implicit flag n).

If they use literal whitespace (which is not matched due to the implicit flag x).

If they use a literal '#' outside of character classes (since this starts a comment due to the implicit flag x).

Of course, instead of using the escape hatch, they could switch to named backreferences and the standard ways of escaping whitespace and '#' that work in every regex flavor that supports flag x.

Oof, depending on numbered capturing groups is pretty common, many examples in the docs even use that. Whitespace and# is a bit less of an issue, but could still be surprising if default. Is there any way to control these three features separately? E.g. to get regex to support numbered capturing groups so one could still take advantage of whitespace and comments?

Since you need the regexp: true line for every regex replacement anyway, I don't see the harm in always setting regexp: "js" instead in every case, if that's what you want.

That seems pretty tedious and fragile. :/

Mixing flavors seems plausible as a desired feature. I can see it being used when copying and pasting existing JS regexes into a brep file, if you don't want to think about any changes for that particular regex.

Oh absolutely. But it should not get in the way of the simple case, which is setting the flavor for the whole file...

slevithan commented 1 month ago

But it should not get in the way of the simple case, which is setting the flavor for the whole file...

Absolutely agree.

Oof, depending on numbered capturing groups is pretty common, many examples in the docs even use that. Whitespace and # is a bit less of an issue, but could still be surprising if default. Is there any way to control these three features separately? E.g. to get regex to support numbered capturing groups so one could still take advantage of whitespace and comments?

Yeah, these are real concerns if it was made the default flavor. The easy way out is of course to make it an opt-in flavor (with native "js" as the default), which is not so bad if you can easily set a default flavor for the file. But in case making regex the default flavor is a salvageable idea, let's explore this...

Flag `x`

I intentionally hide the implicit flags x and n behind obscure options (which are documented as intended only for debugging), but it would be very easy for brep to turn them off. For flag x:

// whitespace and `#` treated literally
regex({flags: 'gm', __flagX: false})`...`

// or for dynamic values without using backticks:
regex({flags: 'gm', __flagX: false})({raw: [pattern]});

This is totally fine/safe to do if you wanted to make flag x an opt-in feature like the ignore_case option. But then, shielding it behind an option would make people at least somewhat less likely to use free-spacing.

Flag `n`

This can be turned off the same way:

// numbered backreferences allowed, and `(...)` captures instead of just grouping
regex({__flagN: false})`...`;

....but, turning flag n off is not safe unless you avoid atomic groups and subroutines. The reason is that emulating an atomic group while still emitting a native RegExp instance requires adding a capturing group to the regex. Using subroutines (via \g<name>) also adds anonymous captures if there are any backreferences in the regex. This is not a problem for the regex library itself since it rewrites backreferences to point them to the correct groups. However, if any captures are added to the output this way, then numbered backreferences that come after them will no longer point to the correct groups on match arrays or in replacement strings (via .replace/.replaceAll). So, to avoid the issue, I leaned into flags x and n always being on.

Based on this conversation, I am now planning to add an __extended_syntax: false (or similar) option in the soon-coming next release. If you wanted, you could then safely leave flag x enabled (giving you free-spacing and comments) while disabling flag n and extended syntax (allowing anonymous captures and numbered backreferences to work, and be safe to use).

Flag `n` as a best practice

Aside: Flag n being always on has several benefits in addition to allowing safe emulation of atomic groups and subroutines:

It encourages using named captures.
- Makes regexes more self-documenting.
- Makes code that uses the results more readable.
- Avoids mistakes when refactoring regexes.
Makes noncapturing groups more readable by allowing you to use (...) and avoid the syntactic clumsiness of (?:...).
It avoids the footgun of referring to named captures by number.
- Avoids mistakes when refactoring regexes.
- Makes regexes more portable since some flavors don't allow it (Ruby, and C++ with nosubs) and in other flavors the numbering of named captures is inconsistent (e.g. in JS it's left to right for all captures, but in .NET it's unnamed captures first followed by named captures).

Path forward for `brep` with `regex`

In the regex library, I treat the implicit flags x and n as enforcing best practices. But I see why this would be problematic (especially flag n) for brep, unless you also adopted the opinionated stance of enforcing best practices.

If I were in your shoes as the author of brep, I think I might do the following:

Make the native "js" flavor the default, to avoid developer surprise.
Highlight in the docs how you can easily opt into more readable multiline regexes by switching the file's regex flavor.
- A possible example of how to explain the regex flavor simply: "This option uses the regex package which supports all JavaScript regex syntax but extends it with several key features from PCRE that can dramatically improve the readability and performance of your regexes. These features include insignificant whitespace, comments, subroutines, definition groups, "named capture only" mode, and atomic groups."

An alternative option might be:

Disable flag n and extended syntax.
Enable flag x by default (like brep already does for flags gmsv).
- Potentially offer an option to disable flag x.
Always use regex (no flavor options, since regex already outputs native regexes and uses native syntax).

LeaVerou commented 1 month ago

Wow thanks for such a detailed writeup! I love it!

One thing I’ve been wondering about is framing it in terms of capabilities rather than flavors. E.g. toggling comments, better whitespace handling, etc rather than selecting a flavor, which seems a bit like exposing an implementation detail to users. Then brep could pick a suitable library (JS or regex or just use regex at all times with suitable flags) based on the capabilities needed. But this only works if the user needs are around capabilities — e.g. if a user need is to select a completely different regex syntax, that actually entails exposing a flavor option.

....but, turning flag n off is not safe unless you avoid atomic groups and subroutines.

Is this a fundamental restriction or an implementation bug? Btw I would have expected that turning flag n off also turns these off and either matches them literally (if a fundamental restriction) or throws (if an implementation bug, so that people don't depend on literal matching and you could support this in the future).

Thinking out loud, I wonder if one way to lift this restriction could be to rewrite all would-be numbered groups into groups with a given prefix (that doesn’t clash with other groups in the regex), and rewrite any backreferences accordingly. Then, things should Just Work™ within the regexp, though require special handling when the regex is used. That could be addressed by subclassing RegExp and returning an object that does this automatically, though I would understand if it’s too much work and not worth it.

slevithan commented 1 month ago

(Oops, accidentally closed the issue and posted an incomplete comment that I've since deleted.)

One thing I’ve been wondering about is framing it in terms of capabilities rather than flavors. E.g. toggling comments, better whitespace handling, etc rather than selecting a flavor, which seems a bit like exposing an implementation detail to users. Then brep could pick a suitable library (JS or regex or just use regex at all times with suitable flags) based on the capabilities needed. But this only works if the user needs are around capabilities — e.g. if a user need is to select a completely different regex syntax, that actually entails exposing a flavor option.

Treating the flavor as an implementation detail is almost always a detriment to a tool's advanced users. The reason is that regex flavors tend to compose innumerable small syntax and behavior differences, in addition to the more obvious large differences. And there is a ton of precedent in other tools for not treating the flavor as an implementation detail. As just one example, ripgrep uses Rust regexes by default, but allows switching to PCRE2 regexes via option -P, --pcre2, or --engine pcre2.

BUT, regex is a special case. Since it produces native RegExp instances and goes out of its way to support 100% of JS regex syntax in exactly the same way (including for in-progress TC39 regex proposals that only work in some browsers, like pattern modifiers and duplicate capture names), it is in fact safe to treat the use of regex as an implementation detail. However, I still wouldn't personally do so because having to opt into each additional feature would discourage their use, and it would additionally make it harder to support additional flavors in the future (which almost certainly would not be best treated as implementation details). For example, maybe you'll eventually want to support Rust regexes via the rregex WASM build of Rust's regex crate. But note: unlike regex, the rregex library is both very large and very slow (compared to native JS regexes).

....but, turning flag n off is not safe unless you avoid atomic groups and subroutines.

Is this a fundamental restriction or an implementation bug?

It's fundamental.

Btw I would have expected that turning flag n off also turns these off and either matches them literally (if a fundamental restriction) or throws (if an implementation bug, so that people don't depend on literal matching and you could support this in the future).

Automatically disabling support for extended syntax when flag n is disabled is a good idea--I'll have to think more about that. I hadn't considered that so far because, until now, the __flagN option was intended only for use in tests and debugging. This idea of using the __flagN and __flagX options in tooling like brep is making me reconsider these options as core features.

Minor note: Turning off support for extended syntax would always lead to throwing when the syntax is used (there is no option for matching it literally), since all of the new syntax is invalid in JS without explicit support for it.

Thinking out loud, I wonder if one way to lift this restriction could be to rewrite all would-be numbered groups into groups with a given prefix (that doesn’t clash with other groups in the regex), and rewrite any backreferences accordingly. Then, things should Just Work™ within the regexp, though require special handling when the regex is used. That could be addressed by subclassing RegExp and returning an object that does this automatically, though I would understand if it’s too much work and not worth it.

You've got all the right ideas here. 😊 Regarding rewriting captures and using unique prefixes, yeah, as you mentioned, unless you also subclassed RegExp and replaced the subclass's exec, @@replace, and @@species, you would pollute the groups object of matches, wouldn't be able to reference the rewritten capturing groups outside of the regex, and wouldn't be able to safely copy the generated regexes.

In fact, regex already does one better by renumbering any numbered backreferences in the pattern as needed, so no capture-name-prefixes are needed, and as a result there is no pollution of groups, no need to rewrite capturing groups, and no problems with using the RegExp constructor to copy a generated regex.

But, even with the renumbering of backreferences to make everything within the pattern work automatically, there is still the issue of numbered submatches on match arrays and numbered backreferences in .replace/.replaceAll. Since flag n is always on, this is not a meaningful issue today, since all captures have names, and people are expected to therefore always reference them by their names (and, within the pattern itself, they are prevented from referencing them by number).

This entire issue could be solved by, like you suggested, having regex return a RegExp subclass with all of the needed extensions. But the reason I haven't done so thus far has less to do with the amount of work involved, and more to do with this Stage 1 proposal for restricting subclassing support in built-in methods. If that proposal were accepted/implemented, it would break this functionality, and the regex library would end up in a bad place. 😖 So, at least for now, I would prefer not to go down this path, unless/until there is more clarity on whether the proposal will advance (which might be a long time coming).

That leaves us in the situation that it is safe to use flag x piecemeal, but not safe to enable extended syntax (specifically, atomic groups and subroutines) without flag n also being enabled.

slevithan commented 1 month ago

Aside: Thinking more about subclassing, one thing I can use it safely for (even if subclassing of RegExp is restricted in the future), is to make it so that if a regex is generated with flag n enabled, any attempt to reference its named captures by number in any way from outside of the regex (e.g. in replacement strings) will throw. Then, if the ability to do this is restricted by JS in the future, it would no longer throw in these edge cases, but everything else would work the same.

There would be a number of complexities and touchpoints to cover (though ES6 proxies should help), so I'll have to think about the tradeoffs. But I'll consider adding this in the next major version. And any feedback would of course be welcome.

PS: None of this complexity applies to brep. Since brep has control over the replacement function (without the need for subclassing or supporting every edge case of the JS language), it should be very simple if you wanted brep to throw when (1) using the regex flavor, (2) flag n is enabled, and (3) named captures are referenced by number in the replacement string. I'd be happy to contribute such a change.

slevithan commented 1 month ago

regex 3.1.0 is now live, and among other improvements, it includes two additions inspired by this convo:

Includes a browser-compatible ESM bundle (at dist/regex.mjs).
Includes new option __extendSyntax, which defaults to the value of __flagN (which means that setting __flagN: false also disables extended syntax).

slevithan commented 1 month ago

regex 4.0.0 is now live, and this release specifically focused on improving and simplifying regex's API for advanced options (which should make it easier for brep to adopt).

As a result, a lot of what I said above is now out of date. 😎

A few specific things:

The previously experimental/debugging options have been redesigned and renamed, and are now fully documented and supported features (see the readme's new Options section, and expand the collapsed details).
Syntax features can now be individually disabled, so e.g. you could use regex({disable: {atomic: true}})`…` to disable support for atomic groups but keep support for other extended syntax. This truly enables your idea of adopting capabilities rather than flavors.
There is a new option subclass: true that makes it perfectly safe to disable flag n and use numbered backreferences inside and outside of the regex (in all contexts and with all methods). So regex({subclass: true, disable: {n: true}})`…` invalidates basically everything I said on the subject before. With subclass enabled, there are no longer any gotchas to be aware of with any of regex's new flags or syntax (they can all be enabled or disabled as you please). I'd recommend that brep just always use subclass: true when operating on user-provided regexes.
- Implementing this turned out to be significantly easier than I described it.
- It subclasses RegExp in a way that uses one feature that could be deprecated in the future (based on the Stage 1 proposal I previously linked to), but if that ever happens (which I think is unlikely), now that I've gone through the exercise of implementing this I now understand how to make it work (with more code) even if I couldn't use the same features in the future.
- In any case, that proposal remains at Stage 1, hasn't moved forward in the 4 years it's been around, there are lots of concerns/complaints from developers about it, and JS is not a language with a history of removing features.

Assuming you're still interested 😊 I'd be happy to help with any PRs, but it's probably best for you to decide up front about exactly how to include regex and expose it to users.

LeaVerou commented 1 month ago

This is amazing! My bandwidth is low right now because my PhD defense is coming up in a few days, but I'm definitely still interested. Need to mull over the API though to be novice-friendly without causing problems for more advanced users.

LeaVerou / brep