Percent-encoded ampersands (&) in URL query string canonicalized incorrectly

More generally, the formulation of "percent encoding normalization" as unescape (ie un percent encode) repeatedly and then percent encode one time is problematic. https://github.com/commoncrawl/ia-web-commons/blob/b2be0a5bf8df4f904b9be3fff14f82a7de30de7b/src/main/java/org/archive/url/BasicURLCanonicalizer.java#L43 https://github.com/commoncrawl/ia-web-commons/blob/b2be0a5bf8df4f904b9be3fff14f82a7de30de7b/src/main/java/org/archive/url/BasicURLCanonicalizer.java#L202-L204

Section 6.2.2.2 of RFC 3986 recommends undoing the encoding of any unnecessarily encoded unreserved character, but not all characters. The unreserved characters are

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

The spec says

For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.

This is a much lighter weight normalization than is currently being done.

commoncrawl / ia-web-commons

Percent-encoded ampersands (&) in URL query string canonicalized incorrectly #31