PuerkitoBio / purell

tiny Go library to normalize URLs
BSD 3-Clause "New" or "Revised" License
468 stars 59 forks source link

Return to using stdlib's URL escaping #14

Open ghost opened 9 years ago

ghost commented 9 years ago

Issue 5684 is now closed, so it might be time to revert PR 9.

mna commented 9 years ago

Yeah, it is now fixed in Go1.5 but it's unlikely people will all upgrade that fast, so I'd be tempted to leave it in for a while, at least until Go1.6, maybe even another release after that. I think a conditional build tag could be used, I may take a look at that when I have a moment, so I'll leave that open.

Thanks for the heads up!

beeker1121 commented 7 years ago

I tried making this change locally (returning u.String() in the NormalizeURL() method and using url.QueryEscape in the sortQuery() method), however quite a bit of the tests fail after doing so.

Should this be expected, and the tests just need to be fixed? I'm just unsure which package (net/url or urlesc) is handling everything correctly, as per the RFC stuff.

Here's the output of a portion of the tests:

purell_test.go:715: running LowerScheme...
purell_test.go:715: running LowerScheme2...
purell_test.go:715: running LowerHost...
purell_test.go:715: running UpperEscapes...
purell_test.go:731: UpperEscapes - FAIL expected 'http://www.whatever.com/Some%AA%20Special%8Ecases/', got 'http://www.whatever.com/Some%aa%20Special%8Ecases/'
purell_test.go:715: running UnnecessaryEscapes...
purell_test.go:731: UnnecessaryEscapes - FAIL expected 'http://www.toto.com/AB.D/23R-/_~', got 'http://www.toto.com/%41%42%2E%44/%32%33%52%2D/%5f%7E'
purell_test.go:715: running RemoveDefaultPort...
purell_test.go:715: running RemoveDefaultPort2...
purell_test.go:715: running RemoveDefaultPort3...
purell_test.go:715: running Safe...
purell_test.go:731: Safe - FAIL expected 'http://www.src.ca/to%1Ato%8B%EE/OKnowABC~', got 'http://www.src.ca/to%1ato%8b%ee/OKnow%41%42%43%7e'
purell_test.go:715: running BothLower...
purell_test.go:731: BothLower - FAIL expected 'http://www.src.ca:80/to%1Ato%8B%EE/OKnowABC~', got 'http://www.src.ca:80/to%1ato%8b%ee/OKnow%41%42%43%7e'
purell_test.go:715: running RemoveTrailingSlash...
purell_test.go:715: running RemoveTrailingSlash2...
purell_test.go:715: running RemoveTrailingSlash3...
purell_test.go:715: running AddTrailingSlash...
purell_test.go:731: AddTrailingSlash - FAIL expected 'http://www.SRC.ca:80/', got 'http://www.SRC.ca:80%2F'

Edit: After researching more into it, it appears that this commit https://github.com/golang/go/commit/874a605af0764a8f340c3de65406963f514e21bc changed how the String() method decides how to encode the Path field. A new RawPath field was introduced in this commit, and this field is used to compare against the Path field. If RawPath is a valid encoding of Path, it simply uses it instead of Path.

One fix that resolves most of the test errors is just setting u.RawPath = u.Path before calling u.String() (if taking urlesc package out of the picture). However, while this solves all of the main tests, the TestDecodeUnnecessaryEscapesAll and TestEncodeNecessaryEscapesAll tests fail.

While the standard String() method gets the majority of the escaping correct after doing the path fix, it still does not handle the !, ', (, ), [, ], and * characters as we would probably want. I believe this is due to Go trying to follow the newest RFC specs. RFC 2396 labels these characters (marks) as unreserved (except [ and ] which most browsers accept, so we include), while RFC 3986 does not allow them anymore.

So, what do you think would be best? AFAIK, most browsers and other clients will still allow the unreserved characters labeled in RFC 2396, which the current urlesc package handles escaping correctly. Maybe it's best to keep using that package and forget about the standard lib for now?

mna commented 7 years ago

Yeah, I did give it a shot a while ago and ran into those issues but I didn't dig deeper at that moment, sorry I should've documented it a bit here. Thanks for looking into it, I think that regardless of what browsers support, the change should try to maintain behaviour with what purell does now.

If it's down to only those few characters, maybe it could be done in purell itself, modifying the *url.URL struct directly, so we could get rid of the external dep.

AlekSi commented 7 years ago

Would like to see the latest comment implemented.

mna commented 7 years ago

Me too! I'd be happy to accept a PR that implements this and makes all tests pass.