Closed ZizzyDizzyMC closed 6 years ago
If you use --1
, grab-site will grab just that URL and its page requisites. <a href="">
s to other URLs aren't treated as page requisites. I see it grabbing the .zip file on that page after I remove --1
. Does that solve the problem, or are you looking for a crawl narrower than what a recursive crawl does on that URL?
It misses the zip on my install, not sure what I can do. I have done gs-dump-urls DIR/wpull.db done | grep zip
and got nothing.
Edit: forgot to mention I removed the --1
as you suggested. Most recent test command of mine was grab-site https://www.mlptf2mods.com/mods/all-class_hats/the_adorable_accolade --delay=1000-3000 --concurrency=2 --no-offsite-links --igon
Ah, sorry, I missed that --no-offsite-links
, which also prevents the grabbing of that .zip file because it's on a different domain (subdomain). It should work with just:
grab-site https://www.mlptf2mods.com/mods/all-class_hats/the_adorable_accolade --delay=1000-3000 --concurrency=2 --igon
That's a little strange, considering using --no-offsite-links
allows it to grab a .jpg
file from subdomain.domain.com
If you had already run a test run using my command with --no-offsite-links
check to see that .jpg
with gs-dump-urls DIR/wpull.db done | grep dash
Is this accidental un-wanted behavior?
Also: Is there a way to limit grab-site to *.domain.com (maybe should open up new issue / request if not)
Right, if something is a stylesheet or <img src="...">
or another thing wpull treats as a page requisite, it is grabbed from other domains even with --no-offsite-links
. But not <a href="...">
s.
I don't think there's an easy way to limit it to a wildcarded subdomain. Feel free to file an issue, but I probably won't have time to work on it unless I get a fairly clean PR.
This might be possible to implement with a new option and a modified def accept_url
in wpull_hooks.py
; hopefully a forked wpull 1.2.3 is not necessary.
Closing because the behavior you observed is expected (despite being a little confusing). Let me know if it doesn't match what I described.
I've done some testing and apparently grab-site just completely misses files for some reason on this site:
grab-site https://www.mlptf2mods.com/mods/all-class_hats/the_adorable_accolade --delay=1000-3000 --concurrency=2 --no-offsite-links --igon --1
It misses the download link, completely and silently never catches it. Not sure where to go from here except make an issue and see if someone else can figure it out.