dispatch / reboot

Scala wrapper for the Java AsyncHttpClient.
https://dispatchhttp.org
GNU Lesser General Public License v3.0
426 stars 105 forks source link

Emoji character in domains not recognized correctly (punycode) #231

Open redsk opened 4 years ago

redsk commented 4 years ago
val u = "https://i❤.ws/"
url(u).url

leads to

java.net.URISyntaxException: Expected authority at index 7: http://
  java.net.URI$Parser.fail(URI.java:2848)
  java.net.URI$Parser.failExpecting(URI.java:2854)
  java.net.URI$Parser.parseHierarchical(URI.java:3102)
  java.net.URI$Parser.parse(URI.java:3053)
  java.net.URI.<init>(URI.java:588)
  dispatch.RawUri.toUri(uri.scala:23)
  dispatch.RawUri.toString(uri.scala:24)
  dispatch.url$.$anonfun$apply$3(requests.scala:102)
  dispatch.Req.requestBuilder$1(requests.scala:48)
  dispatch.Req.toRequestBuilder(requests.scala:53)
  dispatch.Req.toRequest(requests.scala:61)
  dispatch.UrlVerbs.url(requests.scala:125)
  dispatch.UrlVerbs.url$(requests.scala:125)
  dispatch.Req.url(requests.scala:17)

Its punycode [1] conversion [2] (https://xn--i-7iq.ws/) works as expected.

https://i❤.ws/ points to a domain registrar and is SFW.

"https://i❤.ws/" visualization when not between backticks.

Opening a new issue because I cannot reopen an issue closed by a collaborator.

[1] https://en.wikipedia.org/wiki/Punycode [2] used converter: https://www.punycoder.com/

redsk commented 4 years ago

In case you're wondering, java.net.IDN.toASCII(u) fails with

java.lang.IllegalArgumentException: java.text.ParseException: An unassigned code point was found in the input

as it only supports IDNA2003. For emoji IDNA2008 is needed and the icu4j library [1] supports it:

import com.ibm.icu.text.IDNA
val uts46 = IDNA.getUTS46Instance(IDNA.DEFAULT)

val u = "i❤.ws/"
val punycodedDomain = uts46.nameToASCII(u, new java.lang.StringBuilder(), new IDNA.Info()).toString
// punycodedDomain == "xn--i-7iq.ws/"

Apparently, nameToASCII works with domain names and paths but not with protocol, that has to be stripped off before the conversion.

[1] https://mvnrepository.com/artifact/com.ibm.icu/icu4j

redsk commented 4 years ago

I've given a more thorough explanation of the same issue here along with a PR. Maybe you'd like to take the same approach.

farmdawgnation commented 4 years ago

icu4j is using the ICU license. This seems to be a tweaked version of the X11 License which is compatible with our own LGPL.

redsk commented 4 years ago

Yeah as you wrote, they use the ICU license which is deemed compatible with GPL. So it should be ok, no?

farmdawgnation commented 4 years ago

Yeah, that's good. I was just making a note here as I worked through figuring out how I want to integrate it. No action needed.

farmdawgnation commented 4 years ago

Ok, this is a lot more complex than I originally thought looking at the bug report. TL;DR there's no good way for us to support this type of encoding from the url helper because we lean heavily on the URL parsing provided by the JVM by default. That, in turn, is based on a regex that doesn't properly support this type of encoding.

The good news is that there's an existing API which will handle this correctly: host:

@ host("i❤.ws").url
res38: String = "http://xn--i-7iq.ws/"

This API works a bit differently because it doesn't accept a full URL, but that's the same thing that makes the IDN conversion work as expected. For example, to get https you would use:

@ host("i❤.ws").secure.url
res40: String = "https://xn--i-7iq.ws/"

In order to support this from the url API we'd need to re-implement breaking up URLs into their component parts for parsing. I'm willing to investigate that, but it's a much larger project because we've got to be sure we don't incidentally break something else.

Does this unblock your use case for the time being?

redsk commented 4 years ago

Well, if I understand correctly, the host API would work only for hosts and not for full URLs which of course would be rather inconvenient as the parsing would be needed to be done by the caller.

My use case is not blocked as I use a different library to detect URLs in strings and the library can do URL normalization, including using ICU (I did the PR), so I simply pass dispatch the normalized URL.

I created this issue because I believe that other users of Dispatch might have the same problem and because I think this library should be able to handle URL normalization properly (not only emojis, also other characters, as illustrated here), which the current url API does not.

farmdawgnation commented 4 years ago

Well, if I understand correctly, the host API would work only for hosts and not for full URLs which of course would be rather inconvenient as the parsing would be needed to be done by the caller.

How inconvenient this is largely depends on the application.

I created this issue because I believe that other users of Dispatch might have the same problem and because I think this library should be able to handle URL normalization properly (not only emojis, also other characters, as illustrated here), which the current url API does not.

Yep, I hear you. I'm not closing the issue, just pointing out that this isn't going to be a quick, drop-in fix like I originally thought. This will take some time to get right.