ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

Add option to automatically crawl up to any potential directory listing #88

Open dkl3 opened 8 years ago

dkl3 commented 8 years ago

Hi, when I run grab-site, I get the feeling that it doesn't check the directories to see if they're unprotected. To have them do this is crucial to creating a complete site archive (like _vti_cnf directories that are unlinked). Not all sites have "index of" directories, though.

In the past I've had to manually check myself from Google/Bing for a site's unprotected directories.

Adding this as either a Wpull or a grab-site argument would mean a lot.

ivan commented 8 years ago

wpull probably needs a new hook/API for this. Ideally accept_url could just generate parent URLs and feed them into wpull to be queued.

(Actually, maybe I can navigate to the wpull object I need with some kind of wpull_hook.factory.get(...) call? I haven't explored this in detail.)

dkl3 commented 8 years ago

Is there an actual function for "accept_url"? I don't we have a way to use this yet.

dkl3 commented 8 years ago

Will you add the new hook for checking "index of" directories? I'd love that.

ivan commented 8 years ago

@chfoo is it safe to call wpull_hook.factory.get('URLTable').add_many(...) to feed in extra URLs to crawl?

ivan commented 8 years ago

Seems to be working, in any case

ivan commented 8 years ago

Implementing this in grab-site means duplicating some wpull logic (e.g. knowing not to go up above any of the start URLs; parsing and getting the parent URL; making up inline=0, referrer=url, ... values for add_many), so it might actually be better to implement in wpull instead.

Unfinished code is in https://github.com/ludios/grab-site/commits/find-parent-indexes

dkl3 commented 8 years ago

Has there been any progress with this lately?

ivan commented 8 years ago

No immediate plans to do this in grab-site partly for the reasons mentioned above. Maybe someone can try to do this in wpull, or if that doesn't work out, finish the unfinished code above.