commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

Removal of leading WWWnnn. in URL canonicalization is too aggressive #29

Open tfmorris opened 11 months ago

tfmorris commented 11 months ago

This bug (https://github.com/internetarchive/surt/issues/28) reported against the Python SURT module applies to the URL canonicalization here as well.

The following URLs are incorrectly canonicalized with SURT as "com)/".

SURT = "com)/"
1. https://www1355544.com/
2. https://www3288.com/
3. https://www504778.com/
4. https://www556798.com/
5. https://www57912.com/

There's also a difference in the handling of these prefixes between the two packages: the Java package removes ALL leading matching prefixes while the Python package only removes the first one. I think the less aggressive approach of the Python package might be preferable.