fabian-hiller / valibot

The modular and type safe schema library for validating structural data 🤖
https://valibot.dev
MIT License
5.84k stars 179 forks source link

Improve DX with `babel-plugin-transform-regex` #704

Open slevithan opened 1 month ago

slevithan commented 1 month ago

Thoughts on adding babel-plugin-transform-regex to devDependencies?

This would allow changing e.g. the unreadable/unmaintainable IPV4_REGEX:

/^(?:(?:[1-9]|1\d|2[0-4])?\d|25[0-5])(?:\.(?:(?:[1-9]|1\d|2[0-4])?\d|25[0-5])){3}$/u

To the much nicer:

regex`^
  (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d )
  ( \. \g<byte> ){3}
$`

This would then get transpiled into a native regex (you can try it here), without the need for any runtime dependency, any added run-time cost for users, or any interpolation when constructing the regex (which could cause issues with tree-shaking).

This is just an example, and some of the other regexes in library/src/regex.ts could benefit more significantly.

(This is a follow up from #666, where interpolation was removed from regex changes due to tree-shaking concerns.)

fabian-hiller commented 1 month ago

Thanks for the idea! I am not sure how to proceed. I can see the advantages, but also some disadvantages.

slevithan commented 1 month ago

Those are all reasonable, but I'd make a few points in response.

Developers have to look at the plugin first before they can fix, improve or add a regex

IMO this is minor compared to what you're currently expecting of developers who want to fix or improve the regexes, given the current, more-or-less unmaintainable, giant regexes that offer no accompanying guidance or features that make them self-documenting. regex doesn't add anything magical or unusual. All of its (small number of) syntax extensions are available and work the same way in PCRE, for example. And developers don't have to use regex to add new regexes, unless they choose to do so because of the benefits it offers.

We have to trust that the plugin works, since our tests run before the compilation step

Good point, but it can be made better. I'm envisioning an intermediate step that does this transform before running tests, and before additional bundling/minification/etc.

More complex solutions are also possible, like importing regex so tests on pre-compilation code run fine, then stripping out the import during compilation. The plugin could be updated to offer an option to do this automatically. The plugin works by actually running regex with the given arguments and values to achieve the transformations, so the output of regex and its Babel plugin are always identical across the same versions.

Security concerns due to the previous point

Tests can be made to work as noted above. But unmaintainable regexes that few understand or can audit offer their own form of security concern, that tests are not a full solution for. The emoji regex fix in #666 offers an illustrative example. Valibot had tests for it, but didn't think to test that e.g. it should not match plain digits or Unicode format/modifier characters. This stemmed from how it wasn't easy to understand exactly what the regex did and did not match (although in this particular case it was also about understanding the complexity of emoji).

In any case, no worries at all if you don't think it's the right fit! Feel free to close this unless you're looking for more input.

slevithan commented 1 month ago

Another example for the fun of it...

Valibot's IP_REGEX (from here) uses the following monstrosity:

/^(?:(?:[1-9]|1\d|2[0-4])?\d|25[0-5])(?:\.(?:(?:[1-9]|1\d|2[0-4])?\d|25[0-5])){3}$|^(?:(?:[\da-f]{1,4}:){7}[\da-f]{1,4}|(?:[\da-f]{1,4}:){1,7}:|(?:[\da-f]{1,4}:){1,6}:[\da-f]{1,4}|(?:[\da-f]{1,4}:){1,5}(?::[\da-f]{1,4}){1,2}|(?:[\da-f]{1,4}:){1,4}(?::[\da-f]{1,4}){1,3}|(?:[\da-f]{1,4}:){1,3}(?::[\da-f]{1,4}){1,4}|(?:[\da-f]{1,4}:){1,2}(?::[\da-f]{1,4}){1,5}|[\da-f]{1,4}:(?::[\da-f]{1,4}){1,6}|:(?:(?::[\da-f]{1,4}){1,7}|:)|fe80:(?::[\da-f]{0,4}){0,4}%[\da-z]+|::(?:f{4}(?::0{1,4})?:)?(?:(?:25[0-5]|(?:2[0-4]|1?\d)?\d)\.){3}(?:25[0-5]|(?:2[0-4]|1?\d)?\d)|(?:[\da-f]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1?\d)?\d)\.){3}(?:25[0-5]|(?:2[0-4]|1?\d)?\d))$/iu;

Here is the same regex if we used the regex package or its Babel plugin to recreate it with a bit of composition:

regex('i')`
  ^( \g<ipv4> | \g<ipv6> )$

  # Define subpatterns for use above
  ( (?<ipv4> \g<byte> ( \. \g<byte>){3} )
    (?<ipv6>
      (\g<hex4>:){7}     \g<hex4>
    | (\g<hex4>:){1,7}  :
    | (\g<hex4>:){1,6}  :\g<hex4>
    | (\g<hex4>:){1,5} (:\g<hex4>){1,2}
    | (\g<hex4>:){1,4} (:\g<hex4>){1,3}
    | (\g<hex4>:){1,3} (:\g<hex4>){1,4}
    | (\g<hex4>:){1,2} (:\g<hex4>){1,5}
    |  \g<hex4>:       (:\g<hex4>){1,6}
    | : ( (:\g<hex4>){1,7} | : )
    # With zone identifier
    | fe80: (:[\da-f]{0,4}){0,4} % [\da-z]+
    # Mixed addresses
    | :: ( ffff ( :0{1,4} )? : )? \g<ipv4>
    | (\g<hex4>:){1,4} : \g<ipv4>
    )
    (?<hex4> [\da-f]{1,4} )
    (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d )
  ){0}
`;

Okay, that's still quite complicated, but the structure is much clearer. This can be edited by mortals, and those edits can be reviewed by mortals.

And what do you know... by making this readable, I can now spot bugs. First, it fails to match some valid mixed IPv6/IPv4 addresses. For example, it doesn't match any of the following valid addresses:

It also thinks the following address is valid, which it is not (which it self-acknowledges by not matching it when the zone identifier isn't present):

Good luck to anyone who wants to fix these bugs in the original version of the regex! Personally, I don't want to touch it. 😖 And I'm in the 99th percentile of developers comfortable with reading and editing complex regexes.

Also, since I can now read this regex, I can see that there are ways to simplify it. I'll avoid doing that here though since I'm already quite off topic.

(Edit: Collecting related existing issues/PRs here since I'd be happy to come back to fixing this regex in the future: #206, #369.)

fabian-hiller commented 1 month ago

Thanks for the detailed answers. I like the benefits and if you have the time, feel free to take care of the setup and create a PR. You can also create a draft PR to get early feedback on any questions you may have.

Would you be interested in becoming Valibot's Head of Regex? 😎 If you are interested, I would contact you to request a review whenever we add or change a regex.

slevithan commented 1 month ago

Would you be interested in becoming Valibot's Head of Regex? 😎 If you are interested, I would contact you to request a review whenever we add or change a regex.

Thanks for the offer and vote of confidence. I would probably say yes if I was an active Valibot user, but since I'm not, I'm not currently able to make the commitment. Feel free to contact me though if you have more specific regex questions.

I like the benefits and if you have the time, feel free to take care of the setup and create a PR. You can also create a draft PR to get early feedback on any questions you may have.

Cool. 😎 If you're okay with leaving this open for a while, I'll try to come back to it and figure out a reasonable setup, probably after first putting better testing in place for the Babel plugin.

fabian-hiller commented 1 month ago

Yes, no rush! Thanks for your contributions! 🙏