bbottema / email-rfc2822-validator

The world's only Java-based rfc2822-compliant email address validator and parser
64 stars 13 forks source link

How to parse i18n characters #9

Closed kdabir closed 4 years ago

kdabir commented 6 years ago

How to parse following address for example?

"ßoµ" <notifications@example.com>

Current version throws exception while parsing above email address.

And EmailAddressParser.getAddressParts returns null

chconnor commented 6 years ago

Non-ascii is forbidden in legitimate email addresses, at least in 'classic' addresses. There are more recent extensions to SMTP that I don't know much about that allow non-ascii in email headers, but AFAIK the standard protocol is still to use RFC 2047 to encode non-ascii as ascii. You seem to have a decoded address, there. So one option is to make sure you aren't decoding the addresses from the raw header before giving it to the validator.

But of course you are right: our class should be able to extract the address parts, even if the personal name is invalid per the RFC's.

I don't have time to work on this, personally, but maybe @bbottema can take a look at toughening the parser in these cases.

kdabir commented 6 years ago

@chconnor thanks for explaining. I saw the similar behavior using an npm module in node so I was guessing that it (non-ascii character) is not allowed as per RFC.

However, I am actually getting email addresses like this from an email api, and just wanted to extract the actual address (local + domain) and personal name from the entire address. Seems like no present Java/node library can perfectly do that :(

chconnor commented 6 years ago

Hopefully @bbottema has some time to check it out; shouldn't be hard to catch an appropriate exception and just not-fail when this happens. Or better, I suppose, to check for non-ascii preemptively and behave accordingly. Seems like an increasing number of mail servers are accepting and passing through UTF-8 type characters, so we should be able to handle it.

bbottema commented 6 years ago

I would love to add extra support this, but I recently became father and have my hands full (literally!). Adding non-standardized support isn't exactly on the top of my list currently.

chconnor commented 6 years ago

Oh, sure, pull the father card! :-)

I just took a look at it and it's going to be too complicated (and probably not appropriate) for us to handle non-ascii in addresses. I'd suggest pre-processing your addresses before sending them to our class. A brutal but simple way is to just strip out non-ascii characters. If you know the email address is not null, you can just do:

EmailAdressParser.getAddressParts(emailAddressreplaceAll("[^\\x00-\\x7F]", ""), EmailAddressCriteria.RFC_COMPLIANT, false); ...but that may not be what you want to accomplish since it will erase the personal name altogether. Actually extracting the unicode characters would require a significant re-write of this project, and I will guess that it isn't going to happen any time soon.

I don't know how you're getting these email addresses: it's possible that whoever is sending them to you is decoding them from properly-encoded RFC2047 strings, in which case you could ask them to stop doing that and send you the raw addresses from the email headers.

bbottema commented 6 years ago

Using Normalizer

string = Normalizer.normalize(string, Normalizer.Form.NFD);
string = string.replaceAll("[^\\p{ASCII}]", "");
// or for unicode: 
string.replaceAll("\\p{M}", "");

This removes diacritics, but keeps base letters