akka / akka-http

The Streaming-first HTTP server/module of Akka
https://doc.akka.io/docs/akka-http
Other
1.34k stars 594 forks source link

Impossible to parse cyrillic Uri #86

Open akka-ci opened 8 years ago

akka-ci commented 8 years ago

Issue by RomanIakovlev Tuesday Feb 02, 2016 at 19:17 GMT Originally opened as https://github.com/akka/akka/issues/19677


Consider this:

"com.typesafe.akka" %% "akka-http-experimental" % "2.4.2-RC1"

import akka.http.scaladsl.model.Uri

scala> Uri("http://президент.рф/", Uri.ParsingMode.Relaxed)
akka.http.scaladsl.model.IllegalUriException: Illegal URI reference: Invalid input 'п', expected 'EOI', '#', '?', path-abempty or authority (line 1, column 8): http://президент.рф/
       ^
  at akka.http.scaladsl.model.IllegalUriException$.apply(ErrorInfo.scala:40)
  at akka.http.scaladsl.model.Uri$.fail(Uri.scala:741)
  at akka.http.impl.model.parser.UriParser.fail(UriParser.scala:62)
  at akka.http.impl.model.parser.UriParser.parseUriReference(UriParser.scala:33)
  at akka.http.scaladsl.model.Uri$.apply(Uri.scala:209)
  at akka.http.scaladsl.model.Uri$.apply(Uri.scala:199)
  ... 43 elided

Any insights on how to tackle this?

akka-ci commented 8 years ago

Comment by johanandren Wednesday Feb 03, 2016 at 07:03 GMT


If you read the docs of Uri.apply they say

"Parses a valid URI string into a normalized URI reference as defined by http://tools.ietf.org/html/rfc3986#section-4.1. Percent-encoded octets are decoded using the given charset (where specified by the RFC). If strict is false, accepts unencoded visible 7-bit ASCII characters in addition to the RFC"

So this means you will need to transform the unicode hostname into ascii before passing it to apply or construct your Uri instance manually instead of parsing it, like this: Uri(scheme = "http", authority = Uri.Authority(Uri.NamedHost("президент.рф"))). NamedHost will perform the required "punicode" encoding of the IDN (https://en.wikipedia.org/wiki/Internationalized_domain_name).

akka-ci commented 8 years ago

Comment by johanandren Wednesday Feb 03, 2016 at 08:03 GMT


We should discuss if we can improve this when @ktoso is back.

akka-ci commented 8 years ago

Comment by RomanIakovlev Wednesday Feb 03, 2016 at 11:14 GMT


Thanks, understood. While it works, it is a workaround of sorts, because if I use the akka.http.scaladsl.model.Uri#apply(input: ParserInput): Uri, it would automatically distinguish between absolute and relative URIs, a functionality on which I rely. I'm writing a web crawler, and it's nice to just throw org.jsoup.nodes.Element#select("a[href^=\"/\"]") to the aforementioned Uri#apply method and allow it figure out if it's absolute or relative one.

akka-ci commented 8 years ago

Comment by rkuhn Wednesday Feb 03, 2016 at 11:30 GMT


RFC 3986 is very strict on what is allowed within a URI, so I would conclude that the attribute values extracted by JSoup need to be sanitized before they can be used in this context. Adding that code to Uri.apply() does not seem right to me.

akka-ci commented 8 years ago

Comment by johanandren Wednesday Feb 03, 2016 at 11:34 GMT


I agree, my thoughts was that there will be people who will want to parse URI:s that contain IDN host parts, and maybe we should/could provide a separate way to do that easily.

akka-ci commented 8 years ago

Comment by RomanIakovlev Wednesday Feb 03, 2016 at 11:36 GMT


@johanandren my thoughts exactly. And it's not only about the hosts, there are non-ASCII characters in other URI components, like paths, in the wild.

akka-ci commented 8 years ago

Comment by rkuhn Wednesday Feb 03, 2016 at 11:36 GMT


Yes, true. What makes me wonder (in general) is why this punycode thing was even invented, given percent encoding.

akka-ci commented 8 years ago

Comment by drewhk Wednesday Feb 03, 2016 at 11:59 GMT


Because it is for DNS, and % was a no-go for backwards compatibility (if I understand correctly).

On Wed, Feb 3, 2016 at 12:36 PM, Roland Kuhn notifications@github.com wrote:

Yes, true. What makes me wonder (in general) is why this punycode thing was even invented, given percent encoding.

— Reply to this email directly or view it on GitHub https://github.com/akka/akka/issues/19677#issuecomment-179178927.

eiennohito commented 8 years ago

Actual spec for URLs is: https://url.spec.whatwg.org/ It allows unescaped utf-8 unicode code points, at least in fragments. In wild there are completely unescaped URLs as well.

RustedBones commented 5 years ago

I also ran into the same issue. As mentioned above, the URI RFC 3986 is pretty strict. The IRI RFC 3987 though provides better internationalization support. Would it be possible to migrate the model to the later standard?

I've made an attempt in the capturl library to create a IRI model, very inspired from the akka-http Uri, also using parboiled2 parser.

If this is considered relevant, I can try to contribute it into the akka-http project.

jrudolph commented 5 years ago

@RustedBones, thanks for sharing. I wonder how you would use that model in the context of akka-http? The HTTP spec is also pretty strict about how URIs used in the protocol have to look like. How are IRIs used in the HTTP protocol?

RustedBones commented 5 years ago

On the HTTP layer, we don't have to use IRIs directly. The conversion can be done internally by akka-http like this:

http://президент.рф/пре: IRI -> http://xn--d1abbgf6aiiy.xn--p1ai/%D0%BF%D1%80%D0%B5: URI

At the moment this conversion must be done by users. For better usability, It would be nice that the akka-http-client accepts IRIs which are more user-friendly.

jrudolph commented 5 years ago

I see. Indeed that would be nice. Could one solution be to "just" offer a new constructor for Uri that can parse IRIs and then instantly converts them to URIs? An alternative could be that the URIs itself present the content of IRIs (but what to do about the naming then) and only convert to URI when rendering (or specifically asked to do that)?

gaeljw commented 3 years ago

Coming here in 2020, I'm wondering if there is any way using pure Akka to sort this out now, or if we still need another library with IRI support?