colinhacks / zod

TypeScript-first schema validation with static type inference
https://zod.dev
MIT License
33.09k stars 1.15k forks source link

Improve regex DX by adding babel-plugin-transform-regex as dev dependency #3716

Open slevithan opened 1 month ago

slevithan commented 1 month ago

What do you think about adding the regex package's Babel plugin to devDependencies? Since Zod uses a lot of complex regexes, this would allow writing them it a readable and maintainable way that gets transpiled away into native JS regex literals.

From regex's readme:

regex is a template tag that extends JavaScript regular expressions with features from other leading regex libraries that make regexes more powerful and dramatically more readable. It returns native RegExp instances that run with native performance, and can exceed the performance of regex literals you'd write yourself. It's also lightweight, has no dependencies, supports all ES2025 regex features, has built-in TypeScript declarations, and can be used as a Babel plugin to avoid any runtime dependencies or user runtime cost.

Highlights include support for free spacing and comments, atomic groups via (?>…) and possessive quantifiers (e.g. ++) that can help you avoid ReDoS, subroutines via \g<name> and subroutine definition groups via (?(DEFINE)…) that enable powerful subpattern composition, and context-aware interpolation of regexes, escaped strings, and partial patterns.

With the regex library, JavaScript steps up as one of the best regex flavors alongside PCRE and Perl, possibly surpassing C++, Java, .NET, Python, and Ruby.

Note that all of regex's syntax is a strict superset of JS, and its syntax extensions work identically in PCRE (the regex library used by PHP and many others), so there is nothing magical or surprising.

This would allow changing e.g. the unreadable/unmaintainable ipv4Regex from src/types.ts:

/^(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$/

To the much nicer:

regex`^
  (?<byte> 25[0-5] | 2[0-4]\d | 1\d\d | [1-9]?\d)
  (\. \g<byte>){3}
$`

That would then get transpiled into a native regex literal (you can try it here), without any added runtime dependency or run-time cost for users.

Some of the other regexes in src/types.ts would benefit more significantly. To give one more example, here's the ipv6Regex regex that's currently used:

/^(([a-f0-9]{1,4}:){7}|::([a-f0-9]{1,4}:){0,6}|([a-f0-9]{1,4}:){1}:([a-f0-9]{1,4}:){0,5}|([a-f0-9]{1,4}:){2}:([a-f0-9]{1,4}:){0,4}|([a-f0-9]{1,4}:){3}:([a-f0-9]{1,4}:){0,3}|([a-f0-9]{1,4}:){4}:([a-f0-9]{1,4}:){0,2}|([a-f0-9]{1,4}:){5}:([a-f0-9]{1,4}:){0,1})([a-f0-9]{1,4}|(((25[0-5])|(2[0-4][0-9])|(1[0-9]{2})|([0-9]{1,2}))\.){3}((25[0-5])|(2[0-4][0-9])|(1[0-9]{2})|([0-9]{1,2})))$/

And here is the same regex refactored using regex, taking advantage of subroutines and a subroutine definition group:

regex`
  ^ \g<ipv6> $

  (?(DEFINE)
    (?<ipv6>
      ( \g<part>{7}
      | :: \g<part>{0,6}
      | \g<part>    : \g<part>{0,5}
      | \g<part>{2} : \g<part>{0,4}
      | \g<part>{3} : \g<part>{0,3}
      | \g<part>{4} : \g<part>{0,2}
      | \g<part>{5} : \g<part>?
      )
      (\g<segment> | \g<ipv4>)
    )
    (?<part>    \g<segment> :)
    (?<segment> [a-f\d]{1,4})
    (?<ipv4>    \g<byte> (\. \g<byte>){3})
    (?<byte>    25[0-5] | 2[0-4]\d | 1\d\d | [1-9]?\d)
  )
`

Written like this, mortals can understand it, spot bugs, and maintain it (e.g. if you wanted to add support for IPv6 zone identifiers), and other mortals can review those changes. The regex literal emitted for this by regex also runs faster, because it avoids all the unnecessary capturing groups in the original (by default, regex implicitly uses flag n or "named capture only" mode).

To demonstrate that regex readability matters, after rewriting it like this I easily spotted several errors. For example, it doesn't match the following valid addresses:

Also, it thinks the following addresses are valid (they aren't since there should be 6 IPv6 segments rather than 7 in mixed addresses):

Good luck to anyone who wants to fix these bugs in the original version of the regex. Personally, I don't want to touch it. 😖 And I'm in the 99th percentile of developers comfortable with reading and editing complex regexes. Perhaps these issues could have been caught with more tests, but tests are no substitute for readability since being able to understand the regexes helps people know where the gaps might be that need testing, and it allows far more people to spot issues.

If you think this could be helpful, I'd be happy to submit a PR that that adds the dev dependency and updates all of the regexes (at least those that would benefit) for readability. My recommendation would be to extract the regexes out of src/types.ts into src/regex.ts, which types.ts would import. Then the Babel plugin would run only on regex.ts.