RohanNagar / jmail

A modern and lightweight library for working with email addresses in Java
https://www.rohannagar.com/jmail
MIT License
126 stars 6 forks source link

Add ability to get the ASCII only version of an Email #149

Open RohanNagar opened 1 year ago

RohanNagar commented 1 year ago

Please describe the feature that you are requesting

Some email addresses have internationalized domain names. Mail servers are supposed to handle these by converting them to their ASCII equivalent, but some old mail servers may not be doing this. Therefore, it would be helpful if JMail can provide a way to get the ASCII equivalent email address of a parsed email address.

The goal is to create a method on the Email object that would return an ASCII/UTF8 only version, like so:

Email parsed = JMail.validator().tryParse("test@faß.de").get();

String asciiOnly = parsed.toAscii();

Additional context

bbottema/simple-java-mail#463

https://gist.github.com/JamesBoon/feeb7428b3558d581c0459f7302bd9a5

Note that the IDN.toAscii() method uses an out of date standard, IDNA2003. We need to implement the latest standard, INDA2008.

JamesBoon commented 1 year ago

That feature would be a great enhancement!

I am just not sure what would be the right approach. There is the great ICU4J library, but it will add 14MB of dependencies. And if you are on android, the required com.ibm.icu.text.IDNA is already available as android.icu.text.IDNA (Reference).

Maybe it could be an optional (non-transitive) dependency? (And maybe if not present falling back to java.net.IDN)

This issue at okhttp maybe of interest: https://github.com/square/okhttp/issues/6910

RohanNagar commented 1 year ago

Thanks for the additional details! Adding the ICU4J library as an optional dependency might be a good first start.

Even better would be to implement the toAscii method ourselves. I was taking a look at the source code for ICU4J and it doesn't look too bad. There would even be some things we could remove (for example ICU4J checks for some invalid domains that have parts that start with hyphens, but JMail would have already checked for those).

arnt commented 1 year ago

FWIW, such a function would have both advantages and disadvantages.

Advantages: It would help with sending mail to some addresses, I don't know how many. I suspect few. In my experience (I work with this) most of the users use non-ASCII localparts. Turning faß@faß.de into faß@xn--fa-hia.de doesn't help with anything.

Disadvantages: Some servers, Microsoft Exchange is the most prominent but far from the only one, don't handle the ASCII form while searching. If you use Exchange and search for faß, messages containing test@xn--fa-hia aren't returned. This isn't an easy bug to fix due to interactions with PGP, S/MIME and perhaps DKIM, and Microsoft has said they won't even try to fix it. Exchange isn't the only one with this problem, so conversion to ASCII would be a bit of a footgun feature. Likely to lead users into trouble.

RohanNagar commented 1 year ago

@arnt thank you for chiming in with the additional information, this is very useful.

Regarding non-ASCII local-parts: would it then be more beneficial to allow for converting the entire address to ASCII, both the local-part and the domain?

Regarding mail servers not handling the ASCII form in searches - I can see how this might introduce some confusion. I think the intention of JMail is to make working with email addresses easier, so I'm kind of torn on this since I can see some situations where having the ASCII conversion would be useful and some where it might cause confusion. Maybe adding some of these details into a Javadoc would help with potential confusion.

arnt commented 1 year ago

Hi,

having an ASCII represenation would certainly make some things simpler, but it doesn't exist.

Converting the localpart to ASCII isn't possible. Converting to ASCII poses some very big problems, and the main goal of the project is to cater to people who can read and write but don't know the latin alphabet. When the problems turned out to be big, the people who were working on the project decided to drop that feature. This is why RFC 5504 was deprecated.

An example of the kind of problem: Some scripts are written left-to-right and others right-to-left. Email addresses are unambiguously readable as long as both localpart and domain are written in the same direction. But if you use left-to-right ASCII for the localpart and right-to-left for the domain, both parts should be displayed on the same side of the middle @ sign, creating an exciting variety of usability, security and readability problems.