ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.34k stars 134 forks source link

Can't grab Wikimedia thumbnails, even when global is removed from igset file #223

Closed BrinBellway closed 2 years ago

BrinBellway commented 2 years ago

My files are meant to be self-contained rather than being for ingestion to the Internet Archive, so I do want to include Wikimedia thumbnails (also, though less importantly, Gravatar and Tumblr avatars).

I tried to work around the existence of these lines in the global igset by starting a crawl and then immediately removing the contents of the crawl's igset file, but there are still no Wikimedia thumbnails in the resulting WARC, nor do they appear in the log.

Is there a more effective way to turn the global igset off? (I've already copied the parts of the global igset I want to keep into my customary --import-ignores file.)


For replication purposes, the site I am currently trying to crawl is A Collection of Unmitigated Pedantry, and https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Roman_Empire_with_dioceses_in_400_AD.png/1280px-Roman_Empire_with_dioceses_in_400_AD.png (embedded in https://acoup.blog/2022/01/28/collections-rome-decline-and-fall-part-ii-institutions/ ) is one example of the many illustrations that fail to appear in the WARC. I am using grab-site v2.2.2, the latest version available through nixpkgs.

ivan commented 2 years ago

Thanks for the report. I think it was wrong for that ignore to be there, so I've removed it in 09b26c88fd8d548ec9a20aaec40fcc5129ae6fa1.

I'll try to make a PR updating grab-site in nixpkgs soon.

If you need an immediate fix for nixpkgs, you can clone https://github.com/NixOS/nixpkgs and edit grab-site/default.nix to point to a newer version and then use nix-env -f /path/to/nixpkgs -iA grab-site

ivan commented 2 years ago

I've submitted a PR to update grab-site in nixpkgs at https://github.com/NixOS/nixpkgs/pull/185522 and I've updated the nix-env based install steps at https://github.com/ArchiveTeam/grab-site#install-on-another-distribution-lacking-python-37x-or-38x to make it possible to install grab-site 2.2.7 before grab-site is updated in a NixOS release. Let me know if you find any issues.