dart-lang / sdk

The Dart SDK, including the VM, dart2js, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
9.94k stars 1.53k forks source link

uri.parse doesn't seem to follow url.spec.whatwg.org #29420

Open eseidelGoogle opened 7 years ago

eseidelGoogle commented 7 years ago

This was brought to my attention due to: https://github.com/flutter/flutter/issues/9529

For example: Uri.parse('https://www.google.com/#q=[') throws.

https://url.spec.whatwg.org/#path-state Otherwise, run these steps: If c is not a URL code point and not U+0025 (%), validation error. If c is U+0025 (%) and remaining does not start with two ASCII hex digits, validation error. UTF-8 percent encode c using the path percent-encode set, and append the result to buffer.

which my understanding is validation error doesn't actually cause the parse to fail: "A validation error does not mean that the parser terminates. Termination of a parser is always stated explicitly, e.g., through a return statement."

It looks like if we wanted to conform, there are tests: https://github.com/w3c/web-platform-tests/tree/master/url

We would just need to write a dart-based harness for running them.

rmacnak-google commented 7 years ago

@lrhn

lrhn commented 7 years ago

We are trying to follow RFC 3986, while being more lenient than required (escaping some invalid characters, but not all) and performing some of the possible normalizations. We are currently not trying to match the whatwg version. In this case, [ is a gen-delim token, which is not valid in the query part of an URI. Because it's a gen-delim token, the parser does not allow it.

We should consider whether the whatwg parsing algorithm is worth adapting to. It's likely to be what some users will expect (as this bug testifies to), but we won't match all parts of it (for example, we don't have blobs).

eseidelGoogle commented 7 years ago

I would expect user expectations have changed a bit in the 12 years since RFC 3986. :) That said, this isn't a particularly high priority bug. But next time we do go looking at the url library the whatwg had a ton of tests for making it easy to conform to its url spec.

welldone217 commented 7 years ago

It could be fixed ?

eseidelGoogle commented 7 years ago

For no good reason (other than that I used to do this kind of thing as my day job) I wrote a test harness to see how well Dart's Uri.parse passes WHATWG's suite: https://github.com/eseidelGoogle/dart_url_tests

The major difference is that Uri.parse is way more strict (throws exceptions) where WHATWG's seems to try real hard to always give you an answer even for "invalid" urls.

WHATWG also expects something like the Uri class to expose more of the component details, which as far as I can tell Dart's Uri class doesn't currently.

Regardless, this remains a low priority bug IMO, I just did this for fun tonight. :)

eseidelGoogle commented 7 years ago

Here are the results for the curious: https://gist.github.com/eseidelGoogle/99681c9b362596f2b45427d16896864e

Caveats: There are also likely bugs in my harness. Certainly I'm not checking everything the tests expect me to due to limited component accessors on Uri. Also, I've not compared Chrome or Firefox or other Url implementations to know if anyone actually conforms to that suite.

lrhn commented 6 years ago

Our current plan for Dart 2 is to change the Uri parser to match the WhatWG URL parser. This is a breaking change if anyone depends on the current behavior (including the automatic normalization), but it has two advantages. First, it allows more URIs to be parsed (as this bug documents), and it it allows Dart compiled to a browser to leverage the built-in functionality, which should improve performance and code-size). The WhatWG URL parser only recognizes the authority section of "special" URI schemes (http, etc), so we'll inherit that restriction. Hopefully we can work around the remaining restrictions.

We do expose one property less than the WhatWG URL (we don't separate username and password, exposing both as just userInfo), but we should match the rest.

natebosch commented 5 years ago

Is this still planned? Will it need to wait for Dart 3?

illia-romanenko commented 5 years ago

Apparently due to that issue it's hard to implement Google's encoded polyline https://developers.google.com/maps/documentation/utilities/polylinealgorithm as it might produce "[" etc...

lrhn commented 5 years ago

You'll probably have to wait for Dart 3, or us adding an alternative to Uri in some way. Breaking changes are hard.

As for [, the Uri parser is now more permissive and allows [, ] and # in more places. Still, characters have meaning, so you should escape random text when putting it into a URI, not just assume that it will work. If you use someUri.resolveUri(Uri(query: polyline)), it will allow any character in the polyline argument, and escape what needs to be escaped. Never build URIs by string concatenation unless you undestand and satsify the URI grammar. Example:

main() {
  var uri = Uri.parse("http://example.org/whatnot");
  var v = uri.resolveUri(new Uri(query: "\x00\xff[]=&#/\\:?"));
  print(v);
  print(Uri.decodeQueryComponent(v.query));
}
lrhn commented 1 year ago

More interest has been shown on making the Dart Uri class work more like the WhatWG URL standard, because sometimes web users get tripped up by the differences.

We should consider whether it's viable to change the Uri behavior. It'll likely be a breaking change in some regards, but also one that might reduce the deployment size of web code, if it can use the browser code instead of shipping our own version. The biggest risk is likely that the WhatWG URL does not canonicalize the same way the Dart Uri does (it does some things for HTTP/HTTPS URIs, but not, e.g., escape case normaliation). The Dart tools rely on comparing URIs for equality to decide whether two URIs refer to the same library.

If we want to change the Dart Uri class to match the WhatWG URL behavior, we should definitely not implement it ourselves in Dart. Even just manually keeping up with updates on the "public suffix list" is a fool's errand.

Getting the functionality from Cronet would be optimal where it's available.