Open redsk opened 4 years ago
The problem here is that HostNormalizer
uses java.net.IDN
and java.net.IDN.toASCII(u)
fails with
java.lang.IllegalArgumentException: java.text.ParseException: An unassigned code point was found in the input
as it only supports IDNA2003
. For emoji IDNA2008
is needed and the icu4j
library [1] supports it:
import com.ibm.icu.text.IDNA
val uts46 = IDNA.getUTS46Instance(IDNA.DEFAULT)
val u = "iβ€.ws/" // this is safe for work :)
val punycodedDomain = uts46.nameToASCII(u, new java.lang.StringBuilder(), new IDNA.Info()).toString
// punycodedDomain == "xn--i-7iq.ws/"
Would you be interested in a PR that uses icu4j
?
@redsk can you please share your thoughts about the pr for fixing the issue?
Sure, I just realised about it, after posting my last comment.
@redsk can you confirm if this is fixed in the 1.0.0-SNAPSHOT build?
@pgalbraith can you provide a link to that build? AFAIK, URLs with emojis should be in the tests now. That said, the problem is more complex, as detailed here. I believe the final solution is in #12
@pgalbraith I can confirm that URLs with emojis are normalized correctly, namely:
πππ.ga -> xn--ir8hb8a.ga
However, it still does now work for www.rΓΆΓner.de, which not normalized with the correct non-transitional www.xn--rner-vna1l.de
but with the incorrect, transitional "http://www.xn--rner-vna1l.de/"
Nowadays, some URLs use emojis (e.g. - careful, NSFW! - πππ.ga ). While correctly detected by the library, the normalization nullifies them, resulting in
http://null/
).