ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

Add a default get_urls hook to get :orig quality images on Twitter #107

Closed ivan closed 6 years ago

ivan commented 6 years ago

Are there any other obvious URLs to additionally queue when we see certain URLs on various websites? Suggestions welcome.

brandongalbraith commented 6 years ago

Flickr Quora (add ?share=1 to end of URL to prevent the need to login to retrieve content)

ivan commented 6 years ago

tumblr images too

ethus3h commented 6 years ago

This seems similar to https://github.com/ludios/grab-site/issues/88 to me...

Similarly, pages with query strings could also be queued with the query string removed (e.g. /index.php?foo=bar becoming /index.php).

ivan commented 6 years ago

I tried implementing this in branch get-urls-hook, but when wpull gets to the :orig URL in the queue, it just puts it into the skipped state, for reasons unknown to me.

ivan commented 6 years ago

URLs were getting skipped because of grab-site's --no-parent combined with the lack of inline=True in the hook.

Implemented for Twitter and Quora in 0ea3d4093860ac526ea5e2d8c591ea31df3ccd44.

I didn't want to deal with Flickr (it looks like it uses a different secret for the original image anyway), but I would take a PR for it (to get the largest non-original image?).

brandongalbraith commented 6 years ago

Thank you @ivan! 👏👏👏