ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

Missing linked files. #125

Closed ZizzyDizzyMC closed 6 years ago

ZizzyDizzyMC commented 6 years ago

I've done some testing and apparently grab-site just completely misses files for some reason on this site: grab-site https://www.mlptf2mods.com/mods/all-class_hats/the_adorable_accolade --delay=1000-3000 --concurrency=2 --no-offsite-links --igon --1

It misses the download link, completely and silently never catches it. Not sure where to go from here except make an issue and see if someone else can figure it out.

ivan commented 6 years ago

If you use --1, grab-site will grab just that URL and its page requisites. <a href="">s to other URLs aren't treated as page requisites. I see it grabbing the .zip file on that page after I remove --1. Does that solve the problem, or are you looking for a crawl narrower than what a recursive crawl does on that URL?

ZizzyDizzyMC commented 6 years ago

It misses the zip on my install, not sure what I can do. I have done gs-dump-urls DIR/wpull.db done | grep zip and got nothing.

Edit: forgot to mention I removed the --1 as you suggested. Most recent test command of mine was grab-site https://www.mlptf2mods.com/mods/all-class_hats/the_adorable_accolade --delay=1000-3000 --concurrency=2 --no-offsite-links --igon

ivan commented 6 years ago

Ah, sorry, I missed that --no-offsite-links, which also prevents the grabbing of that .zip file because it's on a different domain (subdomain). It should work with just:

grab-site https://www.mlptf2mods.com/mods/all-class_hats/the_adorable_accolade --delay=1000-3000 --concurrency=2 --igon
ZizzyDizzyMC commented 6 years ago

That's a little strange, considering using --no-offsite-links allows it to grab a .jpg file from subdomain.domain.com If you had already run a test run using my command with --no-offsite-links check to see that .jpg with gs-dump-urls DIR/wpull.db done | grep dash

Is this accidental un-wanted behavior?

Also: Is there a way to limit grab-site to *.domain.com (maybe should open up new issue / request if not)

ivan commented 6 years ago

Right, if something is a stylesheet or <img src="..."> or another thing wpull treats as a page requisite, it is grabbed from other domains even with --no-offsite-links. But not <a href="...">s.

I don't think there's an easy way to limit it to a wildcarded subdomain. Feel free to file an issue, but I probably won't have time to work on it unless I get a fairly clean PR.

This might be possible to implement with a new option and a modified def accept_url in wpull_hooks.py; hopefully a forked wpull 1.2.3 is not necessary.

Closing because the behavior you observed is expected (despite being a little confusing). Let me know if it doesn't match what I described.