commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

Fix URL canonicalization to handle non-UTF-8 encoded characters. Fixes #6 #28

Open tfmorris opened 1 year ago

tfmorris commented 1 year ago

Fixes #6

Fixes issue with percent signs (%) getting double escaped for hex encoded characters which use an encoding other than UTF-8.

There is a separate issue with the hex characters being lower case instead of upper case as recommended by both Google canonicalization guidelines (V2) and RFC 3986, which this patch does NOT address.

sebastian-nagel commented 1 year ago

Thanks, @tfmorris! We'll have a look.