Open tfmorris opened 11 months ago
The main intent for this misfeature is to tolerate crappy old web implementations like www1.hp.com and www1.ibm.com. We (and IA) have a lot of crap like that in our old crawls. Yes this is a problem now with garbage www1234.com domains, but let's preserve the toleration of these old url schemes that are actually in our data.
@wumpus thanks for the review, but I'm not sure what, if any, modifications you'd like me to make to this PR.
I want a survey of actual old crawls to make sure the new scheme does what's needed (www1.hp.com -> hp.com) and doesn't do anything bad.
(and I'm not saying you have to do that yourself, I'm saying that it needs to be considered. Bringing up these issues is very valuable, fully solving them will likely involve some analysis by others.)
Fixes #29
This makes two changes to the WWWnnn prefix handling:
It also includes a trivial fix for a duplicate test case that I noticed in my travels.