Open RomeSilvanus opened 6 years ago
Thanks for the report. I have confirmed the first issue so far. You are right that the tumblr igset is unhelpfully ignoring non-tumblr domains, including the t.umblr.com
redirector. I'll see if I can fix the igset. As a workaround (though it might crawl too much tumblr), just not using the singletumblr
igset might help crawl offsite stuff.
(the commits linked by github here don't fix the problem, ignore them)
Sadly not using the singletumblr
ignoreset isnt't an option since a try to crawl a few 100-1000 Tumblr blogs periodically. And I certainly don't want to download all of Tumblr with every single crawl.
But I look forward to a possible solution to these problems!
(Maybe if there would be an option to specify URL that always get included in a crawl, regardless of the ignoreset used? Like a whitelist or a file containing the URLs)
Also trying without the singletumblr
ignoreset just makes the .warc redirect me to the Imgur website instead to a page grabbed by grab-site.
I confirmed that problems 1.) and 3.) are fixed in ca8fd22c02885e8e3dfce20b609daaf1dae68e48 (or, for 1, at least t.umblr.com is no longer ignored). Can you please check if 2.) is fixed, or give me some way to reproduce that problem?
The imgur problem is going to require a separate investigation, so I opened #127 for it.
What URLs did you test this with? I installed a brand new copy of https://github.com/ludios/grab-site/commit/ca8fd22c02885e8e3dfce20b609daaf1dae68e48, but with the URLs I use it's still the same as before.
Same URL. Redirect is still not in the archive and gives a 404.
Confirmed working
Tried the same URL again. The audio files still do not get grabbed. 404.
The URL:
https://www.tumblr.com/audio_file/ask-firefox/106080600065/tumblr_nh2ihqEeN51replby?plead=please-dont-download-this-or-our-lawyers-wont-let-us-host-audio
is a redirect that actually leads to:
https://a.tumblr.com/tumblr_nh2ihqEeN51replbyo1.mp3
(the display URL doesn't actually change to it, but the embed HTML5 player uses it)
I can just guess that grab-site doesn't follow it properly. Since it's not even in the log.
I tried with the URLs you gave for 1 and 3, then used gs-dump-urls
on wpull.db
to check whether wpull grabbed them. Can you try with --igon
to confirm that something isn't ignoring t.umblr.com
?
# grab-site --version
1.7.0
# grab-site http://woonastuck.tumblr.com/post/32979225783/devise-the-most-impossible-but-it-just-might --igsets=singletumblr --ua "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 but not really nor Googlebot/2.1"
[...]
# grep t.umblr.com wpull.log
2018-08-07 20:38:02,352 - wpull.processor.web - INFO - Fetching ‘https://t.umblr.com/redirect?z=http%3A%2F%2Fi.imgur.com%2FwjnoZ.gif&t=ZTY0ODM4NWU0NjM4NWQ5NWM0N2Q0NzQ2OWU1YTA0MDA4ZmM2OWYxOCxybTlkYWZZNg%3D%3D&b=t%3Aq4se2ivApp6dqsTf9TlZTw&p=http%3A%2F%2Fwoonastuck.tumblr.com%2Fpost%2F32979225783%2Fdevise-the-most-impossible-but-it-just-might&m=1’.
2018-08-07 20:38:03,698 - wpull.processor.web - INFO - Fetched ‘https://t.umblr.com/redirect?z=http%3A%2F%2Fi.imgur.com%2FwjnoZ.gif&t=ZTY0ODM4NWU0NjM4NWQ5NWM0N2Q0NzQ2OWU1YTA0MDA4ZmM2OWYxOCxybTlkYWZZNg%3D%3D&b=t%3Aq4se2ivApp6dqsTf9TlZTw&p=http%3A%2F%2Fwoonastuck.tumblr.com%2Fpost%2F32979225783%2Fdevise-the-most-impossible-but-it-just-might&m=1’: 200 OK. Length: unspecified [text/html; charset=utf-8].
I see the problem. I use it with --no-offsite-links
since without it grab-site downloads way too many other Tumblrs and websites.
I was under the impression that it will still grab the redirect even when using this flag since it is kinda an embed in the start URL. It is in the log though.
It does work when not using --no-offsite-links
, but that's not really a good solution since, as I said, it tends to download way too much unrelated data. It makes my ~850MB .warc into a 5GB+ .warc.
But even then. The redirect does not work, and it's not in the archive. Nor does it show up in both applications I tried (Webrecorder Player, OpenWayback).
I know you said it needs more work, but I assumed that it would at least display the embed image on the Tumblr page.
Using grab-site with:
~/gs-venv/bin/grab-site \
--igon \
--dir=/home/user/test/woonastuck/tmp \
--finished-warc-dir=/home/user/test/woonastuck/warc \
--ua "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 but not really nor Googlebot/2.1" \
--igsets=singletumblr \
"http://woonastuck.tumblr.com/post/32979225783/devise-the-most-impossible-but-it-just-might"
For the audio:
At least it does fetch the audio file now. But viewing it in either application still gives me a 404. The HTML5 audio player on the page still doesn't find the file.
So either grab-site doesn't rewrite something correctly, or there's a general problem with these applications.
It looks like the changed singletumblr igset might be preventing crawls starting at the root of a tumblelog eg https://staff.tumblr.com
when they lack a trailing slash.
Is this expected behavior?
No, that's unexpected and undesired, I'll file a bug for it. Thanks for the report.
I hope it is okay if I reopen this here, and to put three issues at once in it!
Coming from my old issue #94
1.) t.umblr.com redirect
I tried this once again and the t.umblr.com redirect still doesn't work.
This is the command I'm using for this test (note, I am currently using the Docker build by slang800/grab-site, I don't think it should make any difference though):
URL=http://woonastuck.tumblr.com/post/32979225783/devise-the-most-impossible-but-it-just-might
DEST=imgur_test
docker exec \
grab-site-server \
grab-site \
--dir=/data/Pony/"Tumblr archive"/"$DEST"/tmp \
--finished-warc-dir=/data/Pony/"Tumblr archive"/"$DEST"/warc \
--ua "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 but not really nor Googlebot/2.1" \
--igsets=singletumblr \
"$URL"
The t.umblr.com links to Imgur. According to your other post it should follow the link when
--no-offsite-links
isn't used, however this still isn't the case.This is how the grab-site grab looks:
When trying to open the link in OpenWayback and WebarchivePlayer I get a 404:
Coming back to what you said about the timeout by
&t=
, I don't think it matters since both the original URL and the one grab-site grabbed are the same:What I want it to do is follow these redirects and also grab the page that they redirect to. Many Tumblrs I try to archive use these redirects and that leaves me with a lot of broken posts.
2.) Images on different subdomains
I also found some other problems with Tumblr that grab-site doesn't grab right.
This is how it should look:
How ever Tumblr does a redirect to a different subdomain:
Trying to open these gives me a 302: Which then opens the right image, however it still doesn't show it on the page.
3.) Audio files
( the URL with the audio: http://ask-firefox.tumblr.com/post/106080600065/im-not-big-on-mistletoe-but-if-you-wanna-rave )
Some links to audio files look as follow:
They redirect to this, but grab-site doesn't follow them and just leaves a 404 in the .warc instead.
I tried adding
a.tumblr.com
to the singletumblr ignoreset, but this didn't help.==============================================
Maybe there is a Regex fix for all of these issues I can put in the singletumblr ignoreset, but I don't really know my way around Regex at all.