Open slevithan opened 4 months ago
Thanks for the idea! I am not sure how to proceed. I can see the advantages, but also some disadvantages.
Those are all reasonable, but I'd make a few points in response.
Developers have to look at the plugin first before they can fix, improve or add a regex
IMO this is minor compared to what you're currently expecting of developers who want to fix or improve the regexes, given the current, more-or-less unmaintainable, giant regexes that offer no accompanying guidance or features that make them self-documenting. regex
doesn't add anything magical or unusual. All of its (small number of) syntax extensions are available and work the same way in PCRE, for example. And developers don't have to use regex
to add new regexes, unless they choose to do so because of the benefits it offers.
We have to trust that the plugin works, since our tests run before the compilation step
Good point, but it can be made better. I'm envisioning an intermediate step that does this transform before running tests, and before additional bundling/minification/etc.
More complex solutions are also possible, like importing regex
so tests on pre-compilation code run fine, then stripping out the import during compilation. The plugin could be updated to offer an option to do this automatically. The plugin works by actually running regex
with the given arguments and values to achieve the transformations, so the output of regex
and its Babel plugin are always identical across the same versions.
Security concerns due to the previous point
Tests can be made to work as noted above. But unmaintainable regexes that few understand or can audit offer their own form of security concern, that tests are not a full solution for. The emoji regex fix in #666 offers an illustrative example. Valibot had tests for it, but didn't think to test that e.g. it should not match plain digits or Unicode format/modifier characters. This stemmed from how it wasn't easy to understand exactly what the regex did and did not match (although in this particular case it was also about understanding the complexity of emoji).
In any case, no worries at all if you don't think it's the right fit! Feel free to close this unless you're looking for more input.
Another example for the fun of it...
Valibot's IP_REGEX
(from here) uses the following monstrosity:
/^(?:(?:[1-9]|1\d|2[0-4])?\d|25[0-5])(?:\.(?:(?:[1-9]|1\d|2[0-4])?\d|25[0-5])){3}$|^(?:(?:[\da-f]{1,4}:){7}[\da-f]{1,4}|(?:[\da-f]{1,4}:){1,7}:|(?:[\da-f]{1,4}:){1,6}:[\da-f]{1,4}|(?:[\da-f]{1,4}:){1,5}(?::[\da-f]{1,4}){1,2}|(?:[\da-f]{1,4}:){1,4}(?::[\da-f]{1,4}){1,3}|(?:[\da-f]{1,4}:){1,3}(?::[\da-f]{1,4}){1,4}|(?:[\da-f]{1,4}:){1,2}(?::[\da-f]{1,4}){1,5}|[\da-f]{1,4}:(?::[\da-f]{1,4}){1,6}|:(?:(?::[\da-f]{1,4}){1,7}|:)|fe80:(?::[\da-f]{0,4}){0,4}%[\da-z]+|::(?:f{4}(?::0{1,4})?:)?(?:(?:25[0-5]|(?:2[0-4]|1?\d)?\d)\.){3}(?:25[0-5]|(?:2[0-4]|1?\d)?\d)|(?:[\da-f]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1?\d)?\d)\.){3}(?:25[0-5]|(?:2[0-4]|1?\d)?\d))$/iu;
Here is the same regex if we used the regex
package or its Babel plugin to recreate it with a bit of composition:
regex('i')`
^( \g<ipv4> | \g<ipv6> )$
# Define subpatterns for use above
( (?<ipv4> \g<byte> ( \. \g<byte>){3} )
(?<ipv6>
(\g<hex4>:){7} \g<hex4>
| (\g<hex4>:){1,7} :
| (\g<hex4>:){1,6} :\g<hex4>
| (\g<hex4>:){1,5} (:\g<hex4>){1,2}
| (\g<hex4>:){1,4} (:\g<hex4>){1,3}
| (\g<hex4>:){1,3} (:\g<hex4>){1,4}
| (\g<hex4>:){1,2} (:\g<hex4>){1,5}
| \g<hex4>: (:\g<hex4>){1,6}
| : ( (:\g<hex4>){1,7} | : )
# With zone identifier
| fe80: (:[\da-f]{0,4}){0,4} % [\da-z]+
# Mixed addresses
| :: ( ffff ( :0{1,4} )? : )? \g<ipv4>
| (\g<hex4>:){1,4} : \g<ipv4>
)
(?<hex4> [\da-f]{1,4} )
(?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d )
){0}
`;
Okay, that's still quite complicated, but the structure is much clearer. This can be edited by mortals, and those edits can be reviewed by mortals.
And what do you know... by making this readable, I can now spot bugs. First, it fails to match some valid mixed IPv6/IPv4 addresses. For example, it doesn't match any of the following valid addresses:
0:0:0:0:0:0:127.32.67.15
0:0:0:0:0:FFFF:127.32.67.15
::1:2:3:4:5:127.32.67.15
It also thinks the following address is valid, which it is not (which it self-acknowledges by not matching it when the zone identifier isn't present):
fe80:::::%1
Good luck to anyone who wants to fix these bugs in the original version of the regex! Personally, I don't want to touch it. 😖 And I'm in the 99th percentile of developers comfortable with reading and editing complex regexes.
Also, since I can now read this regex, I can see that there are ways to simplify it. I'll avoid doing that here though since I'm already quite off topic.
(Edit: Collecting related existing issues/PRs here since I'd be happy to come back to fixing this regex in the future: #206, #369.)
Thanks for the detailed answers. I like the benefits and if you have the time, feel free to take care of the setup and create a PR. You can also create a draft PR to get early feedback on any questions you may have.
Would you be interested in becoming Valibot's Head of Regex? 😎 If you are interested, I would contact you to request a review whenever we add or change a regex.
Would you be interested in becoming Valibot's Head of Regex? 😎 If you are interested, I would contact you to request a review whenever we add or change a regex.
Thanks for the offer and vote of confidence. I would probably say yes if I was an active Valibot user, but since I'm not, I'm not currently able to make the commitment. Feel free to contact me though if you have more specific regex questions.
I like the benefits and if you have the time, feel free to take care of the setup and create a PR. You can also create a draft PR to get early feedback on any questions you may have.
Cool. 😎 If you're okay with leaving this open for a while, I'll try to come back to it and figure out a reasonable setup, probably after first putting better testing in place for the Babel plugin.
Yes, no rush! Thanks for your contributions! 🙏
Thoughts on adding babel-plugin-transform-regex to
devDependencies
?This would allow changing e.g. the unreadable/unmaintainable
IPV4_REGEX
:To the much nicer:
This would then get transpiled into a native regex (you can try it here), without the need for any runtime dependency, any added run-time cost for users, or any interpolation when constructing the regex (which could cause issues with tree-shaking).
This is just an example, and some of the other regexes in
library/src/regex.ts
could benefit more significantly.(This is a follow up from #666, where interpolation was removed from regex changes due to tree-shaking concerns.)