ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.4k stars 135 forks source link

Dupespotter has false positives #43

Open ivan opened 9 years ago

ivan commented 9 years ago

When archiving http://mirrors.pdp-11.ru/, dupespotter incorrectly reports a duplicate and fails to extract links:

DUPE http://mirrors.pdp-11.ru/ftp.cis.upenn.edu/
  OF http://mirrors.pdp-11.ru/ftp.mayn.de/

http://mirrors.pdp-11.ru/ftp.cis.upenn.edu/

http://mirrors.pdp-11.ru/ftp.mayn.de/

ivan commented 9 years ago

I have added a --no-dupespotter option in 7a63a3dcd113e11de218fe6bb0c3ad03153a6954 and I might have it enabled by default in the future.