commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

Percent-encoded ampersands (&) in URL query string canonicalized incorrectly #31

Open tfmorris opened 1 year ago

tfmorris commented 1 year ago

This bug (https://github.com/internetarchive/surt/issues/20) reported against the Python surt module is also present in the Java implementation.

This URL http://example.com/script?type=a+b+%26+c&grape=wine should generate this SURT com,example)/script?grape=wine&type=a+b+%26+c but instead produces com,example)/script?+c&grape=wine&type=a+b+

tfmorris commented 1 year ago

More generally, the formulation of "percent encoding normalization" as unescape (ie un percent encode) repeatedly and then percent encode one time is problematic. https://github.com/commoncrawl/ia-web-commons/blob/b2be0a5bf8df4f904b9be3fff14f82a7de30de7b/src/main/java/org/archive/url/BasicURLCanonicalizer.java#L43 https://github.com/commoncrawl/ia-web-commons/blob/b2be0a5bf8df4f904b9be3fff14f82a7de30de7b/src/main/java/org/archive/url/BasicURLCanonicalizer.java#L202-L204

Section 6.2.2.2 of RFC 3986 recommends undoing the encoding of any unnecessarily encoded unreserved character, but not all characters. The unreserved characters are

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

The spec says

For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.

This is a much lighter weight normalization than is currently being done.