Open tfmorris opened 1 year ago
More generally, the formulation of "percent encoding normalization" as unescape (ie un percent encode) repeatedly and then percent encode one time is problematic. https://github.com/commoncrawl/ia-web-commons/blob/b2be0a5bf8df4f904b9be3fff14f82a7de30de7b/src/main/java/org/archive/url/BasicURLCanonicalizer.java#L43 https://github.com/commoncrawl/ia-web-commons/blob/b2be0a5bf8df4f904b9be3fff14f82a7de30de7b/src/main/java/org/archive/url/BasicURLCanonicalizer.java#L202-L204
Section 6.2.2.2 of RFC 3986 recommends undoing the encoding of any unnecessarily encoded unreserved character, but not all characters. The unreserved characters are
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The spec says
For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.
This is a much lighter weight normalization than is currently being done.
This bug (https://github.com/internetarchive/surt/issues/20) reported against the Python
surt
module is also present in the Java implementation.This URL
http://example.com/script?type=a+b+%26+c&grape=wine
should generate this SURTcom,example)/script?grape=wine&type=a+b+%26+c
but instead producescom,example)/script?+c&grape=wine&type=a+b+