Closed brandongalbraith closed 6 years ago
Got it! Thanks @ivan! I ran into a site that happened to reference fairly large items on archive.org while I wasn't monitoring my crawl (5GB+). I'll update my personal import ignores locally.
(I guess I was talking about web.archive.org, but it sort of applies to archive.org items as well - they sometimes get darked.)
My crawls are for WayBack Machine ingest 😄 Thanks for taking the time to reply!
Would it be appropriate to have some other igset in the ignore_sets
folder for various archival site links? It seems like it would be generally useful.
I've thought about doing this, but it doesn't always make sense, because someone might be trying to capture Wayback before some domain gets robots.txt'ed.
I would recommend using the new
--import-ignores
and aliasinggrab-site
to include the argument.