bbolli / tumblr-utils

Utilities for dealing with Tumblr blogs, Tumblr backup
GNU General Public License v3.0
667 stars 124 forks source link

Retries on failed downloaded media #201

Open indrakaw opened 5 years ago

indrakaw commented 5 years ago

Once or twice, tumblr_backup.py failed to download media (images and videos, I haven't got any error on audio so far). It isn't ERROR 404, but 500, 104, etc. The media itself is exist, I checked it on web browser.

I had to download them manually one-by-one, move to media/ folder, then rename it to related post page and archive. I have do grep and edit the html manually. It lot's of works.

Suggestion Would it nice if there is retries option for media? Eg, if error, retry to download 3 or 5 times.

That's it. Sorry my English isn't good.

Edit: Fixed typos and grammars.

cebtenzzre commented 5 years ago

Can you verify with curl or wget that simply retrying these downloads actually helps? It may be that you need cookies to access certain resources. Also, what specific error codes are you seeing that cannot be reproduced in a browser? Improved error 429 handling is probably doable, if that's applicable here. If your internet connection is flaky, that's a case that might need special handling and a new flag.

API requests are in fact retried up to 10 times, but download_media only makes 1 attempt.

indrakaw commented 5 years ago
HTTP Error 404: Not Found downloading https://66.media.tumblr.com/726a1584c574e008acf6052ec103cb46/tumblr_mxloql6eYo1qbtj59o1_1280.jpg
HTTP Error 404: Not Found downloading https://66.media.tumblr.com/16e1444d7c212e83c540e54c1cdafa5b/tumblr_mx4uqiOkpA1s112kao1_640.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/tumblr_l47lxvUFfQ1qaorkyo1_1280.jpg
HTTP Error 404: Not Found downloading https://66.media.tumblr.com/tumblr_m5niaqmGXR1qixdtpo1_1280.jpg
HTTP Error 404: Not Found downloading https://66.media.tumblr.com/tumblr_lwtrueYw9M1qfkqo0o1_540.png
HTTP Error 404: Not Found downloading https://66.media.tumblr.com/25903a4f884077219036aa4ce4b01a4d/tumblr_mspk0f6t4h1shslwyo1_640.jpg
<urlopen error timed out> downloading http://www.assoc-amazon.jp/e/ir?t=kozo0b-22&amp;l=as2&amp;o=9&amp;a=4087806405
<urlopen error [Errno -3] Temporary failure in name resolution> downloading https://twiant.com/img/banners/pr/120x60.gif
HTTP Error 404: Not Found downloading https://66.media.tumblr.com/d98c39e8809030f38a31060626300b15/tumblr_miqzkqvs2r1qjqxmoo1_1280.jpg
HTTP Error 404: Not Found downloading https://66.media.tumblr.com/tumblr_m2puq5oWCS1qzqc75o1_1280.jpg
HTTP Error 404: Not Found downloading https://66.media.tumblr.com/tumblr_m168qeV1TC1qzcd3bo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/66cf5caf00377609845a4f5e3b65547f/tumblr_mo18vbbhV61rrns0oo1_1280.jpg
abcbabcba: Getting posts 72500 to 72549 of 129433

Aside 404, Errno 104 doesn't download the media. It will left posts and archive files point the remote media.

Here the curl output within the same machine, same connection:

debian@cloudshell:~$ curl -Iv https://66.media.tumblr.com/66cf5caf00377609845a4f5e3b65547f/tumblr_mo18vbbhV61rrns0oo1_1280.jpg
*   Trying 152.199.38.136...
* TCP_NODELAY set
* Connected to 66.media.tumblr.com (152.199.38.136) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=Sunnyvale; O=Tumblr Inc; OU=Information Technology; CN=*.media.tumblr.com
*  start date: Dec  7 00:00:00 2018 GMT
*  expire date: Jun  5 12:00:00 2019 GMT
*  subjectAltName: host "66.media.tumblr.com" matched cert's "*.media.tumblr.com"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x560ab0b09db0)
> HEAD /66cf5caf00377609845a4f5e3b65547f/tumblr_mo18vbbhV61rrns0oo1_1280.jpg HTTP/1.1
> Host: 66.media.tumblr.com
> User-Agent: curl/7.52.1
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 200
HTTP/2 200
< accept-ranges: bytes
accept-ranges: bytes
< access-control-allow-methods: GET
access-control-allow-methods: GET
< access-control-allow-origin: *
access-control-allow-origin: *
< access-control-max-age: 600
access-control-max-age: 600
< age: 148526
age: 148526
< alt-svc: quic=":443"; ma=2592000; v="43,41,39,35"
alt-svc: quic=":443"; ma=2592000; v="43,41,39,35"
< cache-control: max-age=1209600
cache-control: max-age=1209600
< content-type: image/jpeg
content-type: image/jpeg
< date: Sat, 05 Jan 2019 12:53:23 GMT
date: Sat, 05 Jan 2019 12:53:23 GMT
< etag: "877265ee065d4a8a0095f8cb2393c262-1498089600-663f79f"
etag: "877265ee065d4a8a0095f8cb2393c262-1498089600-663f79f"
< last-modified: Thu, 22 Jun 2017 00:00:00 GMT
last-modified: Thu, 22 Jun 2017 00:00:00 GMT
< server: ECAcc (sgc/C942)
server: ECAcc (sgc/C942)
< timing-allow-origin: *
timing-allow-origin: *
< x-cache: HIT
x-cache: HIT
< x-frames: 1
x-frames: 1
< content-length: 151813
content-length: 151813

<
* Curl_http_done: called premature == 0
* Connection #0 to host 66.media.tumblr.com left intact

Seems fine. I checked the file directly on web browser and it's normal.

cebtenzzre commented 5 years ago

I have a feeling this has to do with headers (including user agent). It seems unlikely that Tumblr's servers would be resetting the connection because of e.g. too many requests. Does this happen for the same media every time?

indrakaw commented 5 years ago

Try to throttle network your network while download this:

tumblr_backup.py -j -I i --save-video-tumblr --save-audio honkino
indrakaw commented 5 years ago

Additional log:

$ tumblr_backup.py -j -I i --save-video-tumblr  --save-audio 46pic                                               
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/7dce22db115e0787b6689994035c08c5/tumblr_ok1sxbBAy81vc1y9yo5_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/226e43c69d5e8724a57e0cb9d623be8f/tumblr_ok1s8xKKCo1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/6988581d9d3ce8bcbcf2e5b2e22a1faa/tumblr_ok1sj9ngxH1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/b0e868ced6e81b8e12c93972cb05b5b1/tumblr_ok1rc2jTep1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/0d5d68d88306d2c294b2dd697844a76e/tumblr_okszt1tWvi1vc1y9yo4_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/47aece6113340865fa908d52a72b25a4/tumblr_okaqe7BDBE1vc1y9yo2_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/68f8bf29f5fa40071e0e143a31f16af6/tumblr_ok1rc2jTep1vc1y9yo2_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/6c9b7bd71e661f5937f6e7039e96f2c5/tumblr_ofiq2eSkb21vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/cde4fbcc4134d7842f6b63eb81fc9d3a/tumblr_oeukntViNr1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/18b8ab8fdf48e692e67f0abb60e005d3/tumblr_oeuku9QxPH1vc1y9yo3_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/a35cf4da1d6a548e3b1b1165bb60259d/tumblr_ofd2df7gDz1vc1y9yo4_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/f85ce41c65cc94f647200c12461c4054/tumblr_oetk8jlhbr1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/4755cdda540afd434105c27d6915be57/tumblr_of7chebZKu1vc1y9yo6_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/2085f32c27995c8f3e88e3b4b46e6179/tumblr_oec0hj1yOW1vc1y9yo7_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/efdb48e65cde8f9eebf5b7126e441367/tumblr_oec086t8Xv1vc1y9yo3_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/85f3069de266a9dd9dc71065eda20c68/tumblr_oec0bxfPtd1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/4493cb0fcfc65bfdccbeef9206cc29cd/tumblr_oedkloWaME1vc1y9yo10_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/94af374ad50c81eee25311861a1b69ae/tumblr_oebzvxoPpb1vc1y9yo7_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/ff62fd034c95e229732971975e069c55/tumblr_oebzs3BDdb1vc1y9yo6_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/a9c48ca2ec6c46dc9e85265c59cb3041/tumblr_oedlcuKRMu1vc1y9yo9_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/2c252695f22b4845d4b6bdb9d4b26144/tumblr_ocghx2QBBJ1vc1y9yo7_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/e387f5737515382167a96c61c866ab1c/tumblr_ocg7mu3Cj11vc1y9yo3_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/72d0162be74a7e7f64cdbd3cf4d68706/tumblr_ocg7orwkb81vc1y9yo2_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/2bb24095dedc1a0fa495f2455521aa4a/tumblr_ocg7shrIOK1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/190ab90843f7a1b45202c9d83e62d86f/tumblr_ocggduFa8W1vc1y9yo3_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/36c78e0ce1fed6f5beed9d831867ffc0/tumblr_ocgdbwOt6h1vc1y9yo2_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/35fecb86e6d93c4c77633dec1959d48f/tumblr_ocg7ig7gOB1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/bf019845b5f68a9f32dc812825132984/tumblr_ocfo51ZeyA1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/0afa265aceb1f31eababb1c091ae7050/tumblr_ocg83dlaWI1vc1y9yo5_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/b090f50cd8701def1bd0ac929b3929e1/tumblr_ocgdrn83Ft1vc1y9yo7_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/e2681d72100cb0b53a99e2f378870d91/tumblr_ocg4o2vJU61vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/9c14612fc4452e61553e1106d74a36aa/tumblr_ocfhy8gtXd1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/b7809a79b573fd4a63239b3506639b02/tumblr_ocfhmrty2n1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/8568403168636e0167c48b5db84581a4/tumblr_ocfobmffQg1vc1y9yo6_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/c8f2d95939d2a731f11ebbb205653735/tumblr_ocg83dlaWI1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/d0df083226580f3660f2155b646f64d1/tumblr_ocfgu2HC1Q1vc1y9yo3_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/aba13c3b6416dc231202d420ec589650/tumblr_ocfobmffQg1vc1y9yo4_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/1961e272020f3c06513b9cccef33d3e6/tumblr_ocfgjcLxo71vc1y9yo3_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/b30eeed072ce2dc093726b245b13dc89/tumblr_ocfo51ZeyA1vc1y9yo10_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/94749c0e0c5baadd94d025845b937d5c/tumblr_ocfgu2HC1Q1vc1y9yo6_r1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/2589ecb84ea8fa16b3e3aadc5f37ff70/tumblr_ocfgzdDECG1vc1y9yo6_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/107739a66640a8b753b189819293c9c0/tumblr_ocdiezy20M1vc1y9yo2_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/3599c07ca2d3e359f6b8b6195f7b501c/tumblr_ocfca58ftv1vc1y9yo2_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/c5fd2ecafea1535ef9f660a1717ec374/tumblr_ocdi3d2YLF1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/fe2d14b907f5a5c01cae54a7b7c3a43b/tumblr_ocdhryoliL1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/78c2d9286533fc09fb7b1abf7307b0a1/tumblr_ocdhz5z7XV1vc1y9yo4_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/5414a1091de08a1c84c5934b69569fd6/tumblr_ocdec3dFWa1vc1y9yo4_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/5fe831c162f8900e0c41122ba8bdeed8/tumblr_ocdf53KEs41vc1y9yo2_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/9d84810a8998858d181f789e7992ce0c/tumblr_ocdgc6pRx31vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/2c0b4e860b94bdebf403a6ca71aa749a/tumblr_ocdfvdD0Kb1vc1y9yo1_1280.jpg
[Errno 104] Connection reset by peer downloading https://66.media.tumblr.com/81ea34a25dc318d794008c00a841bb05/tumblr_ocdetxCTdj1vc1y9yo3_1280.jpg
46pic: 2742 posts backed up       

That's a safe-for-work tumblr blog. Most of error images that shown are exits (not 404), but couldn't downloaded.

Missing image can be download manually. After it got downloaded, I have to grep the url then looking for its post and archive page. When it found, rename the image to post ID, then edit the content of pages. Having 12 or less missing images are fine, but 20+ missing images is frustrating.

Please fix this problem.