ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

Tumblr redirect #94

Closed RomeSilvanus closed 7 years ago

RomeSilvanus commented 7 years ago

Not sure if this is the right place. But I was trying to archive Tumblr blog, but ran into the issue that grab-site doesn't follow this specific type of link:

http://t.umblr.com/redirect?z=http%3A%2F%2Fi.imgur.com

In the case of this Tumblr it happens when an image post links to an external image hosted on Imgur.

Is there any way to make grab-site follow this link and archive the Imgur page/image? I was unable to find any way to do it. Putting http://t.umblr.com as URL argument doesn't seem to help.

ivan commented 7 years ago

If you're using the default offsite links grabbing (i.e. without --no-offsite-links), grab-site/wpull should be able to follow t.umblr redirects. I tested this by copying the "clickhole.com parody article" link from http://unwrapping.tumblr.com/post/128241551157/redirect-url-link-posts and running grab-site on it.

If the grab-site dashboard doesn't look like it's following the redirect, it's probably because the redirected-to page was already grabbed, or because it's queued and will be grabbed later. t.umblr doesn't serve a true redirect but rather a page like this:

<noscript>
<meta name="referrer" content="origin">
<meta http-equiv="refresh" content="0;URL=http://www.clickhole.com/article/anthropologists-are-verge-figuring-out-how-youre-s-1299">
</noscript>
<meta name="referrer" content="origin">
<title>http://www.clickhole.com/article/anthropologists-are-verge-figuring-out-how-youre-s-1299</title>
<script>window.opener = null; location.replace("http:\/\/www.clickhole.com\/article\/anthropologists-are-verge-figuring-out-how-youre-s-1299")</script>

so it goes through the normal link extraction and fetching instead of immediately following the redirect.

On large, long-lived crawls, if you see 400 Bad Request on the t.umblr page and the redirected-to page never being grabbed, it might be because the &t= signature that tumblr adds expires before grab-site fetches the t.umblr page. But I have no idea how long those signatures last. It does look like they are enforced, though.

ivan commented 7 years ago

If the redirect-to page is really never grabbed, please reopen with repro steps.