NET-A-PORTER / scala-uri

Simple scala library for building and parsing URIs
Other
261 stars 33 forks source link

A URISyntaxException is thrown parsing the following url: https://krownlab.com/products/hardware-systems/baldur/#baldur-top-mount#1 #114

Closed johann-abraham closed 7 years ago

johann-abraham commented 8 years ago
scala> import com.netaporter.uri.Uri
import com.netaporter.uri.Uri

scala> Uri.parse("https://example.com/products/hardware-systems/blah/#blah-top-mount#1")
java.net.URISyntaxException: Invalid URI could not be parsed. Vector(RuleTrace(List(NonTerminal(Named(_uri),-66), NonTerminal(RuleCall,-66), NonTerminal(Sequence,-66), NonTerminal(FirstOf,-66), NonTerminal(Named(_abs_uri),-66), NonTerminal(RuleCall,-66), NonTerminal(Sequence,-66), NonTerminal(Optional,-15), NonTerminal(Named(_fragment),-15), NonTerminal(RuleCall,-15), NonTerminal(Sequence,-15), NonTerminal(Capture,-14), NonTerminal(ZeroOrMore,-14), NonTerminal(Sequence,0)),NotPredicate(Terminal(AnyOf(#)),1)), RuleTrace(List(NonTerminal(Named(_uri),-66), NonTerminal(RuleCall,-66), NonTerminal(Sequence,-66)),CharMatch(￿))) at index 66: https://example.com/products/hardware-systems/blah/#blah-top-mount#1
  at com.netaporter.uri.parsing.UriParser$.parse(UriParser.scala:67)
  at com.netaporter.uri.Uri$.parse(Uri.scala:303)
  ... 43 elided
evanbennett commented 8 years ago

Based on RFC 3986 3. Syntax Components, a URI may contain at most one '#' which identifies the start of the fragment. A fragment is not permitted to contain a '#'.

christoph-buente commented 7 years ago

Hi @evanbennett,

thx for clarifying the background. But we also use Net-a-porter for parsing urls, which come from access logs. And we run into the same error, because the URLs actually contain a second fragment separator. And it seems like most browsers can cope with it perfectly fine. Is there a way to configure a less restrictive mode to parse such URLs none the less?

christoph-buente commented 7 years ago

Thx @theon, that was lightning fast.

theon commented 7 years ago

@christoph-buente, np!

I have made the parsing of fragments more permissive. It should now successfully parse these URLs. The change is published under version 0.4.16 of scala-uri. Give it a try and let me know if it works as expected. (May take a couple hours to make it to maven central, but is available at https://oss.sonatype.org/content/repositories/releases/com/netaporter/scala-uri_2.11/0.4.16/)

The second # will be considered part of the fragment and as such will be URL encoded to %23 when you call .toString on the URL. E.g.

https://krownlab.com/products/hardware-systems/baldur/#baldur-top-mount#1

will become the valid URL:

https://krownlab.com/products/hardware-systems/baldur/#baldur-top-mount%231

christoph-buente commented 7 years ago

Thx a million, @theon.