NET-A-PORTER / scala-uri

Simple scala library for building and parsing URIs
Other
261 stars 33 forks source link

Parse Errors on invalid URI - A bit to strict? #65

Closed mbseid closed 9 years ago

mbseid commented 10 years ago

When parsing an URI like such: http://localhost:9000/?foo=test&&bar=test. Scala URI throws a parse error. While the URI is technically invalid, I think its a bit strict and should be able to normalize these kind of errors. Lots of websites make that stupid mistake, so as a parser we should be able to handle it.

I'm going to look into the source and see if there is a way I can fix it, but I'm not sure approach you would like me to take.

Let me know what you think

theon commented 10 years ago

Thanks for raising this - it seems reasonable to support those URLs. This should be available in 0.4.3-SNAPSHOT for you to test.

Current behaviour is to consider it as an query parameter with an empty name and empty value. This means the &&are maintained if you parse and then call toString:

scala> val uri = Uri.parse("http://localhost:9000/?foo=test&&bar=test")
uri: com.netaporter.uri.Uri = http://localhost:9000/?foo=test&&bar=test

If you want to remove it you can use uri.removeParams("")

scala> val uri2 = uri.removeParams("")
uri2: com.netaporter.uri.Uri = http://localhost:9000/?foo=test&bar=test

Let me know if that works for you

mbseid commented 10 years ago

Wow, you are truly amazing. Thanks so much for providing such a quick fix for this. I'm going to deploy your new update and let you know how it goes. That behavior is exactly what I what expect and it will work perfectly for me.

Thanks again!

mbseid commented 10 years ago

In production, I see another parse error that may want to get fixed. Any query param with spaces also doesn't get parsed. Here is the bad URL:

localhost:9000/mlb/2014/06/15/david-wrights-slump-continues-why-new-york-mets-franchise-third-baseman-must-be-gone-before-seasons-end/?utm_source=RantSports&utm_medium=HUBRecirculation&utm_term=MLBNew York MetsGrid

I'm going to take a look and see if I can fix this on my own. If you have any tips or comments, let me know.

Thanks,

Mike

theon commented 10 years ago

I think this is another case of something that is technically not a valid URL (spaces should be encoded to %20 or +), but it something we should still support.

I think This URL would work in 0.4.2, but in the latest SNAPSHOT I've made the parser stricter as a result of the changes in #51.

I'm actually thinking I should revert the changes in #51. We should be able to do this, as the root cause of #51 has been fixed in parboiled2:

https://github.com/sirthias/parboiled2/issues/78

sauravshah commented 10 years ago

Adding one more example of strict parsing.

localhost:9000/t?x=y%26

Will this be fixed in 0.4.2?

ov7a commented 10 years ago

Three more cases: http://localhost/offers.xml?&id=10748337&np=1 http://localhost/offers.xml?id=10748337&np=1& http://localhost/offers.xml?id=10748337&np=1&#anchor

JustAHappyKid commented 10 years ago

Any update on this? Will 0.4.3 be released anytime soon? Currently being troubled by one of those URLs with the dangling &...

Thanks. :-)

btd commented 10 years ago

Got the same problem. @theon could you clarify if 0.4.3 will be published? thanks

theon commented 10 years ago

Hi, sorry for the delay.

I have reverted the changes from #51 that made the parser a lot stricter and published 0.4.3. Could you give it a try.

@ov7a @sauravshah @mbseid I added tests for your examples in 47e6353. Some of the example URIs didn't start with http://. These will now parse with 0.4.3 but without the protocol, URIs like localhost/xyz will be parsed as relative URIs with localhost as part of the path. What do you think about this? I am reluctant to change this as that is how they would be treated by a browser if put in a href, for example.

JustAHappyKid commented 10 years ago

@theon: Definitely makes sense to me to behave as browsers do, especially if it's also in line with the standards.

JustAHappyKid commented 10 years ago

And the dangling ampersand use-case seems to be working okay now. E.g., Uri.parse("https://hi.com/some-path?this=that&a=b&") does not lead to nasty exception. Thanks!

ov7a commented 9 years ago

@theon, thanks! I think browser-like behaviour in situations like this is the best one can do.

I brought another example. Originally, this was a link to facebook photo.

https://example.com/path/ms.c.eE~;SoMeThingLongLikeBase64~;MoReOfthis-;a~-~-.id.a.3450750937450/8375385797349587/?type=1&param=1

Invalid URI could not be parsed. 3 rules mismatched at error location:
  _uri / | / _abs_uri / optional / _authority / optional / _userInfo / capture / oneOrMore / ANY
  _uri / | / _abs_uri / optional / _authority / optional / _userInfo / optional / ":" / ':'
  _uri / | / _abs_uri / optional / _authority / optional / _userInfo / "@" / '@'
 at index 128: https://example.com/path/ms.c.eE~;SoMeThingLongLikeBase64~;MoReOfthis-;a~-~-.id.a.3450750937450/8375385797349587/?type=1&param=1
theon commented 9 years ago

@ov7a Thanks for the report. This was cause by the matrix parameter support getting confused. I don't think many people use matrix params, so regret making it compulsory.

I've just published a new version 0.4.4 where matrix param support is now optional and disabled by default (README link above also updated). So if you upgrade to 0.4.4 that URL should now work without you having to make any changes.

theon commented 9 years ago

Forgot to put the issue number in the commit. Commit is c9f632dad4ffa195f5a97b53593fc99e7ba39bab

ov7a commented 9 years ago

@theon, wow! That was super fast! Thank yo very much! It works now.