bbolli / tumblr-utils

Utilities for dealing with Tumblr blogs, Tumblr backup
GNU General Public License v3.0
667 stars 124 forks source link

youtube_dl error with emebds, ydl.prepare_filename vs. sanitize_filename #78

Closed Hrxn closed 5 years ago

Hrxn commented 8 years ago

Okay... Decided to look into this again, with my own test case..

Let me clean up the old stuff here first..

Hrxn commented 7 years ago

Here a small Tumblr blog that I used for testing: http://broken-embeds-test.tumblr.com/ Contains 4 videos, embedded from Instagram, and 4 pictures (posted in Tumblr as link, posting as image doesn't work straightforwardly, apparently).

Here's the comparison between the commits with the changes to the function. https://github.com/bbolli/tumblr-utils/compare/2c92d2f816ab34ab595d6a2c3defb5bd4525d3b9...1d3b15fec0609f1258d305fff7de95a9e441cc67

The result for 1d3b15fec0609f1258d305fff7de95a9e441cc67

D:\Etc\TUMBLR\1d3b15f>D:\Inst\Python\python.exe D:\Src\tumblr-utils-1d3b15fec0609f1258d305fff7de95a9e441cc67\tumblr_backup.py --save-video broken-embeds-test.tumblr.com
WARNING: Falling back on generic information extractor.r.com: 0 remaining posts to save
WARNING: Falling back on generic information extractor.
WARNING: Falling back on generic information extractor.
WARNING: Falling back on generic information extractor.
Unable to download video in post #151625452707
Unable to download video in post #151625405267
Unable to download video in post #151625431177
Unable to download video in post #151625377647
broken-embeds-test.tumblr.com: 8 posts backed up

D:\Etc\TUMBLR\1d3b15f>

The result for 2c92d2f816ab34ab595d6a2c3defb5bd4525d3b9

D:\Etc\TUMBLR\2c92d2f>D:\Inst\Python\python.exe D:\Src\tumblr-utils-2c92d2f816ab34ab595d6a2c3defb5bd4525d3b9\tumblr_backup.py --save-video broken-embeds-test.tumblr.com
WARNING: Falling back on generic information extractor.r.com: broken-embeds-test.tumblr.com: 0 remaining posts to save
WARNING: Falling back on generic information extractor.
WARNING: Falling back on generic information extractor.
WARNING: Falling back on generic information extractor.
WARNING: Falling back on generic information extractor.
WARNING: Falling back on generic information extractor.
WARNING: Falling back on generic information extractor.
WARNING: Falling back on generic information extractor.
broken-embeds-test.tumblr.com: 8 posts backed up

D:\Etc\TUMBLR\2c92d2f>

(I also have the console output with 'quiet': 'False' in the YoutubeDL properties if necessary, but it only says that it's failing, not why...)

Comparing the result:

D:\Etc\TUMBLR>dir 1d3b15f\broken-embeds-test.tumblr.com\media
 Volume in drive D is Home
 Volume Serial Number is 8851-7591

 Directory of D:\Etc\TUMBLR\1d3b15f\broken-embeds-test.tumblr.com

File Not Found

D:\Etc\TUMBLR>dir 2c92d2f\broken-embeds-test.tumblr.com\media
 Volume in drive D is Home
 Volume Serial Number is 8851-7591

 Directory of D:\Etc\TUMBLR\2c92d2f\broken-embeds-test.tumblr.com\media

10.10.2016  23:27    <DIR>          .
10.10.2016  23:27    <DIR>          ..
19.09.2016  09:38         4.733.902 BKh4Z19g6TB_waldbaumalex_Video_by_waldbaumalex.mp4
19.09.2016  10:17         2.068.892 BKh89aYAEqB_waldbaumalex_Video_by_waldbaumalex.mp4
19.09.2016  10:14         4.438.481 BKh8l8dAr67_waldbaumalex_Video_by_waldbaumalex.mp4
19.09.2016  10:12         1.909.457 BKh8TtDgZmG_waldbaumalex_Video_by_waldbaumalex.mp4
               4 File(s)     13.150.732 bytes
               2 Dir(s)  18.384.117.760 bytes free

D:\Etc\TUMBLR>

As you can see, 1d3b15f doesn't has the media subdir, while 2c92d2f has a media subdir with 4 .mp4 files inside..

@bbolli would be interesting to know if this issue only happens on Windows..

Anyway, the culprit seems to be that media_filename = ydl.prepare_filename(result), the old variant, can download the files, where media_filename = sanitize_filename(filetmpl % result['entries'][0], restricted=True) doesn't work anymore..

indrakaw commented 7 years ago

I'm using Ubuntu 14 LTS.

This error only happens when --save-video is enable. Because neither youtube-dl nor tumblr_backup doesn't support Instagran yet. Please improve this.

Hrxn commented 7 years ago

This error only happens when --save-video is enable.

Yes. That's exactly the problem, usage of youtube-dl within tumblr_backup, obviously.

And youtube-dl does support Instagram: https://github.com/rg3/youtube-dl/blob/master/youtube_dl/extractor/instagram.py

Can you provide some example Instagram links to reproduce the problem? We'll see if my theory is right..

indrakaw commented 7 years ago

Can you provide some example Instagram links to reproduce the problem?

Try this:

tumblr_backup.py --save-video --save-audio -k --image-names -N 0 ablogthathasaninstagrampostonit

Be sure youtube-dl is installed via pip.

Hrxn commented 7 years ago

Result of the Tumblr API for

ablogthathasaninstagrampostonit

{
"meta": {
"status": 404,
"msg": "Not Found"
},
"response": []
}

You sure that's the right one?

indrakaw commented 7 years ago

@Hrxn

You sure that's the right one?

no. to be honest, it's honkawa (NSFW blog)

Hrxn commented 7 years ago

Ah, okay.

This blog is working fine, but where exactly are the Instagram videos that return an error?

indrakaw commented 7 years ago

@Hrxn sorry for late respond.

youtube-dl is installed via pip install youtube_dl so, there's no issue about out-to-date version. For error, here: https://asciinema.org/a/advempwlelwwpj7w8eqhbubwh

That term was recorded via CodeEnvy, remotely, because it tooks hours to download in my location.

Hrxn commented 7 years ago

Okay.. I think I see the issue here..

For example, some posts that return the ERROR: Unable to extract video url; please report [...] Taken from your log:

https://honkawa.tumblr.com/post/127313935430                                                                                                               
https://honkawa.tumblr.com/post/127308001840                                                                                                               
https://honkawa.tumblr.com/post/127301058340                                                                                                               
https://honkawa.tumblr.com/post/127231379175                                                                                                               
https://honkawa.tumblr.com/post/127282439995                                                                                                               
https://honkawa.tumblr.com/post/127245103290                                                                                                               
https://honkawa.tumblr.com/post/127207461220                                                                                                               
https://honkawa.tumblr.com/post/127214144045                                                                                                               
https://honkawa.tumblr.com/post/127154233910                                                                                                               
https://honkawa.tumblr.com/post/127154217275     

These are all pictures hosted on Instagram, not videos. (Not even really NSFW considering Instagrams policies).

And youtube-dl doesn't work here, it returns the error mentioned in your log, because it only accepts videos at the moment.

Every Tumblr post belongs to a certain type (Tumblr Dashboard now shows Text, Photo, Quote, Link, Chat, Auto, Video, for example.)

In the past, it was possible to select Photo, and then select 'add photo from URL' and use the link to an Instagram post here. You now had a photo post, picture linking to that Instagram post, but the photo was also on Tumblr, could be backed up by tumblr-utils etc. I assume this was only introduced pretty recently, since Instagram added these multi-page/multi-photo posts maybe.

This doesn't seem to work any longer. You now have to use the Link type. Or, ironically, and that is also what your example blog (honkawa) is doing, use the post type Video > Add video from web And youtube-dl doesn't work here, as mentioned..

Example here: https://embedded-demos.tumblr.com/ vs https://embedded-demos.tumblr.com/archive

@bbolli Can you reproduce? Any ideas here, how to work around this kind of "type confusion"?

Sucks that Instagram photos also don't work any longer with tumblr-utils.

Although, the picture still appears to be there as before:

E:\Test\Test>curl -s -o "1.jpg" https://68.media.tumblr.com/f1b98d4636c658d2558eaea7e5615ae7/tumblr_oo5392llzc1wn82de_og_1280.jpg
E:\Test\Test>curl -s -o "2.jpg" https://scontent.cdninstagram.com/t51.2885-15/e35/17125916_1832179880333322_5336448482473410560_n.jpg

SHA256 hash of file 1.jpg:
cd 7a d2 11 d8 6e ec 61 be 4f 08 6b ba 09 b1 f6 e2 83 23 a0 b2 14 5e d7 18 ad 56 51 16 d9 93 c9
CertUtil: -hashfile command completed successfully.
SHA256 hash of file 2.jpg:
cd 7a d2 11 d8 6e ec 61 be 4f 08 6b ba 09 b1 f6 e2 83 23 a0 b2 14 5e d7 18 ad 56 51 16 d9 93 c9
CertUtil: -hashfile command completed successfully.

Don't know about these new multi-page posts, though..

bbolli commented 7 years ago

There's nothing tumblr-backup can do about these kinds of JavaScript-infested posts. The basic problem is that each platform is building silos in which they try to keep their content only to themselves.

I guess your best bet is to write a patch to youtube-dl that can extract the images from the Instagram embeds.