Open redsk opened 4 years ago
In case you're wondering, java.net.IDN.toASCII(u)
fails with
java.lang.IllegalArgumentException: java.text.ParseException: An unassigned code point was found in the input
as it only supports IDNA2003
. For emoji IDNA2008
is needed and the icu4j
library [1] supports it:
import com.ibm.icu.text.IDNA
val uts46 = IDNA.getUTS46Instance(IDNA.DEFAULT)
val u = "i❤.ws/"
val punycodedDomain = uts46.nameToASCII(u, new java.lang.StringBuilder(), new IDNA.Info()).toString
// punycodedDomain == "xn--i-7iq.ws/"
Apparently, nameToASCII
works with domain names and paths but not with protocol, that has to be stripped off before the conversion.
icu4j is using the ICU license. This seems to be a tweaked version of the X11 License which is compatible with our own LGPL.
Yeah as you wrote, they use the ICU license which is deemed compatible with GPL. So it should be ok, no?
Yeah, that's good. I was just making a note here as I worked through figuring out how I want to integrate it. No action needed.
Ok, this is a lot more complex than I originally thought looking at the bug report. TL;DR there's no good way for us to support this type of encoding from the url
helper because we lean heavily on the URL parsing provided by the JVM by default. That, in turn, is based on a regex that doesn't properly support this type of encoding.
The good news is that there's an existing API which will handle this correctly: host
:
@ host("i❤.ws").url
res38: String = "http://xn--i-7iq.ws/"
This API works a bit differently because it doesn't accept a full URL, but that's the same thing that makes the IDN conversion work as expected. For example, to get https you would use:
@ host("i❤.ws").secure.url
res40: String = "https://xn--i-7iq.ws/"
In order to support this from the url
API we'd need to re-implement breaking up URLs into their component parts for parsing. I'm willing to investigate that, but it's a much larger project because we've got to be sure we don't incidentally break something else.
Does this unblock your use case for the time being?
Well, if I understand correctly, the host
API would work only for hosts and not for full URLs which of course would be rather inconvenient as the parsing would be needed to be done by the caller.
My use case is not blocked as I use a different library to detect URLs in strings and the library can do URL normalization, including using ICU (I did the PR), so I simply pass dispatch the normalized URL.
I created this issue because I believe that other users of Dispatch might have the same problem and because I think this library should be able to handle URL normalization properly (not only emojis, also other characters, as illustrated here), which the current url
API does not.
Well, if I understand correctly, the host API would work only for hosts and not for full URLs which of course would be rather inconvenient as the parsing would be needed to be done by the caller.
How inconvenient this is largely depends on the application.
I created this issue because I believe that other users of Dispatch might have the same problem and because I think this library should be able to handle URL normalization properly (not only emojis, also other characters, as illustrated here), which the current url API does not.
Yep, I hear you. I'm not closing the issue, just pointing out that this isn't going to be a quick, drop-in fix like I originally thought. This will take some time to get right.
leads to
Its punycode [1] conversion [2] (
https://xn--i-7iq.ws/
) works as expected.https://i❤.ws/
points to a domain registrar and is SFW."https://i❤.ws/" visualization when not between backticks.
Opening a new issue because I cannot reopen an issue closed by a collaborator.
[1] https://en.wikipedia.org/wiki/Punycode [2] used converter: https://www.punycoder.com/