commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

WaybackURLKeyMaker to keep non-utf8 percent encodings #6

Open sebastian-nagel opened 7 years ago

sebastian-nagel commented 7 years ago

WaybackURLKeyMaker.makeKey(url) replaces percent signs by %25 in percent-encoded URL with bytes not representing valid utf-8 encoded characters (before RFC 3986):

http://www.aluroba.com/tags/%C3%CE%CA%C7%D1%E5%C7.htm -> com,aluroba)/tags/%25c3%25ce%25ca%25c7%25d1%25e5%25c7.htm https://1kr.ua/newslist.html?tag=%E4%EE%F8%EA%EE%EB%FC%ED%EE%E5 -> ua,1kr)/newslist.html?tag=%25e4%25ee%25f8%25ea%25ee%25eb%25fc%25ed%25ee%25e5

Python's surt module behaves different which breaks look-up in CDX files for such URLs.

sebastian-nagel commented 7 years ago

Difficult to solve: Python (2.7) and Java have different string types, based on bytes resp. Unicode characters. The "surt" module used with Python 3 causes a similar problem (internetarchive/surt#19).