ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.39k stars 135 forks source link

Enhancement idea: URL prioritization #81

Open ethus3h opened 8 years ago

ethus3h commented 8 years ago

I suggest having an option to mark some URLs as low priority by regex, in a manner similar to ignores; such URLs would be downloaded after everything else.

Also an option to automatically mark all off-site URLs as low-priority, and an option to automatically mark all off-domain page requisites as low-priority?

ethus3h commented 8 years ago

One use case for this would be if one knows a site is going to go down, and so it's important to get the site's links first, but the offsite links would also be valuable.

ivan commented 8 years ago

That would be useful but I have no idea how to do this. It may require a new wpull hook.

ethus3h commented 8 years ago

I really should learn how to program on grab-site and wpull so I can try to help make these things happen :3

12As commented 8 years ago

I think this might be better addressed upstream, at least in the offsite-links case. My cursory look at the engine code (at least where it gets the urls) suggests that wpull's modified BFS uses a priority queue to implement that. This may make it so that an additional if-then clause could be added where the priority is computed to add some arbitrarily large number to the priority of off-site links.

Regardless, it may be useful to get the some input from @chfoo on this.

Sanqui commented 8 years ago

I believe this is related to https://github.com/chfoo/wpull/issues/297.

ivan commented 7 years ago

https://gist.github.com/JustAnotherArchivist/b82f7848e3c14eaf7717b9bd3ff8321a