ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

Add archive.org to global ignore set #118

Closed brandongalbraith closed 6 years ago

ivan commented 6 years ago

I've thought about doing this, but it doesn't always make sense, because someone might be trying to capture Wayback before some domain gets robots.txt'ed.

I would recommend using the new --import-ignores and aliasing grab-site to include the argument.

brandongalbraith commented 6 years ago

Got it! Thanks @ivan! I ran into a site that happened to reference fairly large items on archive.org while I wasn't monitoring my crawl (5GB+). I'll update my personal import ignores locally.

ivan commented 6 years ago

(I guess I was talking about web.archive.org, but it sort of applies to archive.org items as well - they sometimes get darked.)

brandongalbraith commented 6 years ago

My crawls are for WayBack Machine ingest 😄 Thanks for taking the time to reply!

jodizzle commented 6 years ago

Would it be appropriate to have some other igset in the ignore_sets folder for various archival site links? It seems like it would be generally useful.