ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Is it possible to crawl only a domain and its subdomains? #233

Closed ghost closed 5 months ago

ghost commented 7 months ago

Hello, I'm trying to archive a Xenforo forum.

Let's say the site is forum.com and it has attachments which I want to download on a subdomain i.forum.com

Is there any way to only crawl from the domains forum.com and i.forum.com?

This yields: grab-site: error: argument -H/--span-hosts: not allowed with argument --span-hosts-allow

Grab-site seems to use --span-hosts-allow by default without there being a way to disable it, which may make "--span-hosts --domains forum.com,i.forum.com" work.

Does anyone have a solution?

ivan commented 7 months ago

grab-site supports multiple start URLs. Does it work properly if you do grab-site --no-offsite-links https://forum.com/ https://i.forum.com/?

ghost commented 7 months ago

Thank you, it seems to work. I tested it with a single forum thread, it still grabs some offsite links, but much less than without --no-offsite-links and it does indeed include the attachments on the subdomain

ivan commented 5 months ago

The remaining "offset links" are probably just page requisites?