Open akka-ci opened 8 years ago
Comment by johanandren Wednesday Feb 03, 2016 at 07:03 GMT
If you read the docs of Uri.apply
they say
"Parses a valid URI string into a normalized URI reference as defined by http://tools.ietf.org/html/rfc3986#section-4.1. Percent-encoded octets are decoded using the given charset (where specified by the RFC). If strict is false, accepts unencoded visible 7-bit ASCII characters in addition to the RFC"
So this means you will need to transform the unicode hostname into ascii before passing it to apply or construct your Uri
instance manually instead of parsing it, like this: Uri(scheme = "http", authority = Uri.Authority(Uri.NamedHost("президент.рф")))
. NamedHost
will perform the required "punicode" encoding of the IDN (https://en.wikipedia.org/wiki/Internationalized_domain_name).
Comment by johanandren Wednesday Feb 03, 2016 at 08:03 GMT
We should discuss if we can improve this when @ktoso is back.
Comment by RomanIakovlev Wednesday Feb 03, 2016 at 11:14 GMT
Thanks, understood. While it works, it is a workaround of sorts, because if I use the akka.http.scaladsl.model.Uri#apply(input: ParserInput): Uri
, it would automatically distinguish between absolute and relative URIs, a functionality on which I rely. I'm writing a web crawler, and it's nice to just throw org.jsoup.nodes.Element#select("a[href^=\"/\"]")
to the aforementioned Uri#apply
method and allow it figure out if it's absolute or relative one.
Comment by rkuhn Wednesday Feb 03, 2016 at 11:30 GMT
RFC 3986 is very strict on what is allowed within a URI, so I would conclude that the attribute values extracted by JSoup need to be sanitized before they can be used in this context. Adding that code to Uri.apply()
does not seem right to me.
Comment by johanandren Wednesday Feb 03, 2016 at 11:34 GMT
I agree, my thoughts was that there will be people who will want to parse URI:s that contain IDN host parts, and maybe we should/could provide a separate way to do that easily.
Comment by RomanIakovlev Wednesday Feb 03, 2016 at 11:36 GMT
@johanandren my thoughts exactly. And it's not only about the hosts, there are non-ASCII characters in other URI components, like paths, in the wild.
Comment by rkuhn Wednesday Feb 03, 2016 at 11:36 GMT
Yes, true. What makes me wonder (in general) is why this punycode thing was even invented, given percent encoding.
Comment by drewhk Wednesday Feb 03, 2016 at 11:59 GMT
Because it is for DNS, and % was a no-go for backwards compatibility (if I understand correctly).
On Wed, Feb 3, 2016 at 12:36 PM, Roland Kuhn notifications@github.com wrote:
Yes, true. What makes me wonder (in general) is why this punycode thing was even invented, given percent encoding.
— Reply to this email directly or view it on GitHub https://github.com/akka/akka/issues/19677#issuecomment-179178927.
Actual spec for URLs is: https://url.spec.whatwg.org/ It allows unescaped utf-8 unicode code points, at least in fragments. In wild there are completely unescaped URLs as well.
I also ran into the same issue.
As mentioned above, the URI
RFC 3986 is pretty strict. The IRI
RFC 3987 though provides better internationalization support. Would it be possible to migrate the model to the later standard?
I've made an attempt in the capturl
library to create a IRI
model, very inspired from the akka-http Uri
, also using parboiled2
parser.
If this is considered relevant, I can try to contribute it into the akka-http project.
@RustedBones, thanks for sharing. I wonder how you would use that model in the context of akka-http? The HTTP spec is also pretty strict about how URIs used in the protocol have to look like. How are IRIs used in the HTTP protocol?
On the HTTP layer, we don't have to use IRIs
directly.
The conversion can be done internally by akka-http
like this:
http://президент.рф/пре: IRI -> http://xn--d1abbgf6aiiy.xn--p1ai/%D0%BF%D1%80%D0%B5: URI
At the moment this conversion must be done by users. For better usability, It would be nice that the akka-http-client
accepts IRIs which are more user-friendly.
I see. Indeed that would be nice. Could one solution be to "just" offer a new constructor for Uri
that can parse IRIs and then instantly converts them to URIs? An alternative could be that the URIs itself present the content of IRIs (but what to do about the naming then) and only convert to URI when rendering (or specifically asked to do that)?
Coming here in 2020, I'm wondering if there is any way using pure Akka to sort this out now, or if we still need another library with IRI support?
Issue by RomanIakovlev Tuesday Feb 02, 2016 at 19:17 GMT Originally opened as https://github.com/akka/akka/issues/19677
Consider this:
"com.typesafe.akka" %% "akka-http-experimental" % "2.4.2-RC1"
Any insights on how to tackle this?