ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

tumblr archives may not play back properly #126

Open RomeSilvanus opened 6 years ago

RomeSilvanus commented 6 years ago

I hope it is okay if I reopen this here, and to put three issues at once in it!

Coming from my old issue #94

1.) t.umblr.com redirect

I tried this once again and the t.umblr.com redirect still doesn't work.

This is the command I'm using for this test (note, I am currently using the Docker build by slang800/grab-site, I don't think it should make any difference though):

URL=http://woonastuck.tumblr.com/post/32979225783/devise-the-most-impossible-but-it-just-might DEST=imgur_test

docker exec \ grab-site-server \ grab-site \ --dir=/data/Pony/"Tumblr archive"/"$DEST"/tmp \ --finished-warc-dir=/data/Pony/"Tumblr archive"/"$DEST"/warc \ --ua "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 but not really nor Googlebot/2.1" \ --igsets=singletumblr \ "$URL"

The t.umblr.com links to Imgur. According to your other post it should follow the link when --no-offsite-links isn't used, however this still isn't the case.

This is how the grab-site grab looks: screenshot 2018-08-03 22 13 43

When trying to open the link in OpenWayback and WebarchivePlayer I get a 404: screenshot 2018-08-03 22 14 00

Coming back to what you said about the timeout by &t=, I don't think it matters since both the original URL and the one grab-site grabbed are the same: screenshot 2018-08-03 22 15 12

What I want it to do is follow these redirects and also grab the page that they redirect to. Many Tumblrs I try to archive use these redirects and that leaves me with a lot of broken posts.

2.) Images on different subdomains

I also found some other problems with Tumblr that grab-site doesn't grab right.

This is how it should look: screenshot 2018-08-03 22 38 43

How ever Tumblr does a redirect to a different subdomain: screenshot 2018-08-03 22 30 43

Trying to open these gives me a 302: screenshot 2018-08-03 22 30 53 Which then opens the right image, however it still doesn't show it on the page.

3.) Audio files

( the URL with the audio: http://ask-firefox.tumblr.com/post/106080600065/im-not-big-on-mistletoe-but-if-you-wanna-rave )

Some links to audio files look as follow: screenshot 2018-08-03 22 34 30

They redirect to this, but grab-site doesn't follow them and just leaves a 404 in the .warc instead. screenshot 2018-08-03 22 35 05

I tried adding a.tumblr.com to the singletumblr ignoreset, but this didn't help.

==============================================

Maybe there is a Regex fix for all of these issues I can put in the singletumblr ignoreset, but I don't really know my way around Regex at all.

ivan commented 6 years ago

Thanks for the report. I have confirmed the first issue so far. You are right that the tumblr igset is unhelpfully ignoring non-tumblr domains, including the t.umblr.com redirector. I'll see if I can fix the igset. As a workaround (though it might crawl too much tumblr), just not using the singletumblr igset might help crawl offsite stuff.

(the commits linked by github here don't fix the problem, ignore them)

RomeSilvanus commented 6 years ago

Sadly not using the singletumblr ignoreset isnt't an option since a try to crawl a few 100-1000 Tumblr blogs periodically. And I certainly don't want to download all of Tumblr with every single crawl.

But I look forward to a possible solution to these problems!

(Maybe if there would be an option to specify URL that always get included in a crawl, regardless of the ignoreset used? Like a whitelist or a file containing the URLs)

Edit:

Also trying without the singletumblr ignoreset just makes the .warc redirect me to the Imgur website instead to a page grabbed by grab-site.

ivan commented 6 years ago

I confirmed that problems 1.) and 3.) are fixed in ca8fd22c02885e8e3dfce20b609daaf1dae68e48 (or, for 1, at least t.umblr.com is no longer ignored). Can you please check if 2.) is fixed, or give me some way to reproduce that problem?

The imgur problem is going to require a separate investigation, so I opened #127 for it.

RomeSilvanus commented 6 years ago

What URLs did you test this with? I installed a brand new copy of https://github.com/ludios/grab-site/commit/ca8fd22c02885e8e3dfce20b609daaf1dae68e48, but with the URLs I use it's still the same as before.

1.) t.umblr.com redirect:

Same URL. Redirect is still not in the archive and gives a 404. screenshot 2018-08-07 21 57 38

2.) Navigation buttons

Confirmed working

3.) Audio redirect:

Tried the same URL again. The audio files still do not get grabbed. 404.

The URL: https://www.tumblr.com/audio_file/ask-firefox/106080600065/tumblr_nh2ihqEeN51replby?plead=please-dont-download-this-or-our-lawyers-wont-let-us-host-audio

is a redirect that actually leads to: https://a.tumblr.com/tumblr_nh2ihqEeN51replbyo1.mp3 (the display URL doesn't actually change to it, but the embed HTML5 player uses it)

I can just guess that grab-site doesn't follow it properly. Since it's not even in the log.

ivan commented 6 years ago

I tried with the URLs you gave for 1 and 3, then used gs-dump-urls on wpull.db to check whether wpull grabbed them. Can you try with --igon to confirm that something isn't ignoring t.umblr.com?

ivan commented 6 years ago
# grab-site --version
1.7.0
# grab-site http://woonastuck.tumblr.com/post/32979225783/devise-the-most-impossible-but-it-just-might --igsets=singletumblr --ua "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 but not really nor Googlebot/2.1"
[...]
# grep t.umblr.com wpull.log
2018-08-07 20:38:02,352 - wpull.processor.web - INFO - Fetching ‘https://t.umblr.com/redirect?z=http%3A%2F%2Fi.imgur.com%2FwjnoZ.gif&t=ZTY0ODM4NWU0NjM4NWQ5NWM0N2Q0NzQ2OWU1YTA0MDA4ZmM2OWYxOCxybTlkYWZZNg%3D%3D&b=t%3Aq4se2ivApp6dqsTf9TlZTw&p=http%3A%2F%2Fwoonastuck.tumblr.com%2Fpost%2F32979225783%2Fdevise-the-most-impossible-but-it-just-might&m=1’.
2018-08-07 20:38:03,698 - wpull.processor.web - INFO - Fetched ‘https://t.umblr.com/redirect?z=http%3A%2F%2Fi.imgur.com%2FwjnoZ.gif&t=ZTY0ODM4NWU0NjM4NWQ5NWM0N2Q0NzQ2OWU1YTA0MDA4ZmM2OWYxOCxybTlkYWZZNg%3D%3D&b=t%3Aq4se2ivApp6dqsTf9TlZTw&p=http%3A%2F%2Fwoonastuck.tumblr.com%2Fpost%2F32979225783%2Fdevise-the-most-impossible-but-it-just-might&m=1’: 200 OK. Length: unspecified [text/html; charset=utf-8].
RomeSilvanus commented 6 years ago

I see the problem. I use it with --no-offsite-links since without it grab-site downloads way too many other Tumblrs and websites. I was under the impression that it will still grab the redirect even when using this flag since it is kinda an embed in the start URL. It is in the log though.

It does work when not using --no-offsite-links, but that's not really a good solution since, as I said, it tends to download way too much unrelated data. It makes my ~850MB .warc into a 5GB+ .warc.

But even then. The redirect does not work, and it's not in the archive. Nor does it show up in both applications I tried (Webrecorder Player, OpenWayback).

screenshot 2018-08-07 22 58 12 screenshot 2018-08-07 22 58 17

I know you said it needs more work, but I assumed that it would at least display the embed image on the Tumblr page.


Using grab-site with:

~/gs-venv/bin/grab-site \ --igon \ --dir=/home/user/test/woonastuck/tmp \ --finished-warc-dir=/home/user/test/woonastuck/warc \ --ua "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 but not really nor Googlebot/2.1" \ --igsets=singletumblr \ "http://woonastuck.tumblr.com/post/32979225783/devise-the-most-impossible-but-it-just-might"


For the audio:

At least it does fetch the audio file now. But viewing it in either application still gives me a 404. The HTML5 audio player on the page still doesn't find the file.

So either grab-site doesn't rewrite something correctly, or there's a general problem with these applications.

beret commented 5 years ago

It looks like the changed singletumblr igset might be preventing crawls starting at the root of a tumblelog eg https://staff.tumblr.com when they lack a trailing slash.

Is this expected behavior?

ivan commented 5 years ago

No, that's unexpected and undesired, I'll file a bug for it. Thanks for the report.