commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

Improve www\d*. prefix handling #30

Open tfmorris opened 11 months ago

tfmorris commented 11 months ago

Fixes #29

This makes two changes to the WWWnnn prefix handling:

It also includes a trivial fix for a duplicate test case that I noticed in my travels.

wumpus commented 10 months ago

The main intent for this misfeature is to tolerate crappy old web implementations like www1.hp.com and www1.ibm.com. We (and IA) have a lot of crap like that in our old crawls. Yes this is a problem now with garbage www1234.com domains, but let's preserve the toleration of these old url schemes that are actually in our data.

tfmorris commented 10 months ago

@wumpus thanks for the review, but I'm not sure what, if any, modifications you'd like me to make to this PR.

wumpus commented 10 months ago

I want a survey of actual old crawls to make sure the new scheme does what's needed (www1.hp.com -> hp.com) and doesn't do anything bad.

wumpus commented 10 months ago

(and I'm not saying you have to do that yourself, I'm saying that it needs to be considered. Bringing up these issues is very valuable, fully solving them will likely involve some analysis by others.)