Open dkl3 opened 8 years ago
wpull probably needs a new hook/API for this. Ideally accept_url
could just generate parent URLs and feed them into wpull to be queued.
(Actually, maybe I can navigate to the wpull object I need with some kind of wpull_hook.factory.get(...)
call? I haven't explored this in detail.)
Is there an actual function for "accept_url"? I don't we have a way to use this yet.
Will you add the new hook for checking "index of" directories? I'd love that.
@chfoo is it safe to call wpull_hook.factory.get('URLTable').add_many(...)
to feed in extra URLs to crawl?
Seems to be working, in any case
Implementing this in grab-site means duplicating some wpull logic (e.g. knowing not to go up above any of the start URLs; parsing and getting the parent URL; making up inline=0, referrer=url, ...
values for add_many
), so it might actually be better to implement in wpull instead.
Unfinished code is in https://github.com/ludios/grab-site/commits/find-parent-indexes
Has there been any progress with this lately?
No immediate plans to do this in grab-site partly for the reasons mentioned above. Maybe someone can try to do this in wpull, or if that doesn't work out, finish the unfinished code above.
Hi, when I run grab-site, I get the feeling that it doesn't check the directories to see if they're unprotected. To have them do this is crucial to creating a complete site archive (like _vti_cnf directories that are unlinked). Not all sites have "index of" directories, though.
In the past I've had to manually check myself from Google/Bing for a site's unprotected directories.
Adding this as either a Wpull or a grab-site argument would mean a lot.