ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Dupespotter has false positives #43

Open ivan opened 8 years ago

ivan commented 8 years ago

When archiving http://mirrors.pdp-11.ru/, dupespotter incorrectly reports a duplicate and fails to extract links:

DUPE http://mirrors.pdp-11.ru/ftp.cis.upenn.edu/
  OF http://mirrors.pdp-11.ru/ftp.mayn.de/

http://mirrors.pdp-11.ru/ftp.cis.upenn.edu/

http://mirrors.pdp-11.ru/ftp.mayn.de/

ivan commented 8 years ago

I have added a --no-dupespotter option in 7a63a3dcd113e11de218fe6bb0c3ad03153a6954 and I might have it enabled by default in the future.