fabian-hiller / valibot

The modular and type safe schema library for validating structural data 🤖
https://valibot.dev
MIT License
6.32k stars 203 forks source link

Email action does not support punycode domains #882

Open hendrikheil opened 1 month ago

hendrikheil commented 1 month ago

The current email action uses a regex, which does not support punycode domains. Those domains are increasingly popular in countries like Germany, where letters like ä, ö and ü are commonly used, but need to be punycode encoded for them to work properly.

I did some research to find a well-known regex that supports these encodings properly. Luckily, the HTML Living Standard provides a JS compatible regex that passes punycode domains as well.

The regex can be found here: https://html.spec.whatwg.org/multipage/input.html#valid-e-mail-address

fabian-hiller commented 1 month ago

The email action intentionally only validates common email addresses. If you are interested in an action that covers the entire specification, please see issue #204. We can of course still discuss whether to add these characters to the currently used regex, but we are unlikely to replace the regex entirely.

hendrikheil commented 1 month ago

I went ahead and closed my PR - I had somehow missed the previous PRs, sorry! I think it would make sense to support punycode domains in the current email action though. I think adding that would be simple enough and keep the regex maintainable.

I checked how our API actually validates emails and besides DNS, they actually use a parser for RFC compliance. I'm not sure that this is in scope due to the amount of code it would have to add, even when it becomes a new rfcEmail action.

https://github.com/egulias/EmailValidator/blob/4.x/src/EmailParser.php#L13

I'd be happy to add punycode to the current regex, if you think thats reasonable?

fabian-hiller commented 1 month ago

Would you prefer to use a specEmail or rfcEmail action or would it be better to extend the current regex? I just want to be careful with the current regex not to extend it from time to time to the email RFC, as this action is intended to validate only common addresses and I am not sure if punycode domains should be part of that.

hendrikheil commented 1 month ago

I think for us extending the current regex would be good enough for the time being. I think as ascii domains become more scarce over time more and more domains will use punycode in regions where characters that require it are used. One of the products we build also uses a punycode domain. I think it's very frustrating for those users to not pass the validation with their domain.

I think it's great that the current regex is limited to be safe and easy to understand. But I think it's also natural for users to eventually want a more exhaustive validation. It feels like a natural upgrade path to start with the very safe and decently permissive email action and then eventually upgrade to a rfcEmail or specEmail action that is more exhaustive and maybe larger in bundle size.

For us both approaches would work. But I think it makes sense to support puncycode domain in the current regex and also add an rfcEmail action too.

fabian-hiller commented 4 weeks ago

I will do some research and think about it. Do you have statistics on how common punycode domains are? At the moment I think it would be best to stick with the current regex and add a rfcEmail action. But the bigger question is what is the right RFC regex? I have found several:

hendrikheil commented 4 weeks ago

I'll have to do some research in regards to statistics too. Though, I think if we add an action that validates emails according to RFC 5322 we might not want go to with a regex. We could implement a lexer that parses the string instead.

However with valibot having a focus on bundle size, it might not be what we want? In my personal opinion it would be a worthwhile trade-off, considering that most users will likely go with the default email action. A note in the documentation should make the trade-off clear enough, I think.

fabian-hiller commented 4 weeks ago

What is a lexer and why is it preferable to a regex for a rfcEmail action? In terms of bundle size, this shouldn't be a problem since each action is tree-shakable. If you don't use `rfcRegex', it will be automatically removed in the build step of your website or app.

We already documented very clearly via JSDoc and on our website that the email action intentionally only validates common email addresses. Would you still improve it in some way?

hendrikheil commented 4 weeks ago

A lexer would just a little custom parser to process the string - alternative to a regex. I know that through tree-shaking the code would be removed unless you use it. I was mostly concerned that the action itself could end up relatively big depending on what scope we decide for it. If that ends up being the case, I would mostly document the trade-offs between a larger but exhaustive rfcEmail action and a small but limited email action.

I've done some research, and from what I can tell, if we want to support email addresses that are fully compliant with RFC5322 that would mean a regex won't work anymore. Since that RFC allows comments inside addresses, which are potentially nested structures, a regex would be unable to fully support parsing such an email. I'm not an expert on the matter, but it seems to be because the structure makes it a context-free language, which you can't fully parse using a regular language.

Supporting all valid email addresses seems like an impossible task. I found this to be a useful read on my research https://www.regular-expressions.info/email.html. Even though it looks very thorough, it will not support emails such as test@localhost, which are valid and something I believe should be covered by this action.

I also looked at what yup and zod do. Zod in particular seems to have gone through a similar process as I did in my hunt for a good regex: https://github.com/colinhacks/zod/blob/1f4f0dacf313a2dba45563d78171e6f016096925/src/types.ts#L587-L602

Yup on the other hand uses the same regex you found and I had initially proposed: https://github.com/jquense/yup/blob/5a22c16dbba610050e85f123d389ddacaa92a0ad/src/string.ts#L19-L22

With that in mind, I think we should go with a commonly used regex that is ideally too permissive than too restrictive for this particular action. To me https://html.spec.whatwg.org/multipage/input.html#valid-e-mail-address looks like the best trade-off and a good candidate.

fabian-hiller commented 4 weeks ago

Thank you for your research! I will take a closer look at it in the next few weeks. For now, I think I fully agree with your conclusion and we should add a rfcEmail action using the HTML spec email regex. Would you be interested in creating a PR?

hendrikheil commented 4 weeks ago

I'd be happy to contribute the action! Do you think we should still call it rfcEmail now that we aren't complying with any RFC?

Maybe calling it webEmail is a better name for it? rfcEmail could still make sense if we plan on evolving the regex in the future

fabian-hiller commented 4 weeks ago

webEmail is a bit unclear to me. What about specEmail? If the regex is almost compatible with the RFC, rfcEmail is still ok with me.

hendrikheil commented 2 weeks ago

Took me a while to get some time to tackle this. I went with rfcEmail as it does still constitute a mostly compliant regex.

https://github.com/fabian-hiller/valibot/pull/912