bbolli / tumblr-utils

Utilities for dealing with Tumblr blogs, Tumblr backup
GNU General Public License v3.0
668 stars 124 forks source link

Hacky fixes to archive likes #114

Open aspensmonster opened 5 years ago

aspensmonster commented 5 years ago

This doesn't actually have to be merged. It's more for anyone else who's looking to archive their years worth of likes before Tumblr rm -rfs them into oblivion.

I intended to make it cleaner, but given Tumblr's two week deadline before they delete all NSFW content --and I'm sure plenty of other content that's going to get swept up in this accidentally-- I figured someone might find it useful in its present state.

The code is updated to archive all likes, with a ten second pause between API calls to try and avoid hitting API quotas. The previous code would only get the 1000 most recent likes.

Potential remaining problems:

I was going to get around to those problems "eventually", but my philosophy is that I can also download the json payloads and sort it all out... "eventually".

I've been running this script like so:

./tumblr_backup.py --likes --outdir=/home/some/path/tumblr_backup_likes/ --save-video --save-audio -j --exif='' some_blog_name

Edit: Be sure that the outdir is not where you have your own blog's posts backed up. You'll overwrite the html code. As far as this script is concerned, a backup of a blog's posts, and a backup of a blog's likes, both render to the same HTML layout. So if you want to keep both, you need to separate the two in different paths.

I've saved 70 GiB worth of historical likes this way. Not everything is saved. Some posts caused youtube-dl to choke out (an annoying problem that never goes away given the frenetic update cadence that most video sites seem to adhere to). Some videos that are downloaded don't seem to want to play. And of course some of the liked content has been deleted over the years. But 70 GiB of mostly-good salvage of 30k+ likes is better than 0.

Obviously the likes have to be public, and if you want to gather everything --including stuff that has been NSFW tagged? Or something else? All I know is my original run didn't grab everything-- then the blog will need to, for the time of running the script at least, be marked as containing sensitive content. Or something. I don't remember all of the specifics. Just that a second pass after doing that yielded more results.

Also, I'd advise setting up your own app and using your own API key. Likes can only be iterated over 20 at a time, and if you have tens of thousands of likes, then you could conceivably go over hourly/daily limits. I don't know what quotas the public API key that the script is using has (maybe it's not rate limited?), but it'd probably be best to be a good neighbor and get your own API key to use with these tweaks.

vl09 commented 5 years ago

You're a legend!!! Just saved a few thousand posts and likes before everything goes down. You only forgot to define MAX_LIKES there but it works perfectly fine otherwise! Thank you!!

Hrxn commented 5 years ago

Woah, wait a sec..

I've heard about the issues Tumblr recently had with its App on Apple's Walled Garden (App Store), but this is new to me:

[..] but given Tumblr's two week deadline before they delete all NSFW content [..]

Is this official?

aspensmonster commented 5 years ago

You only forgot to define MAX_LIKES there but it works perfectly fine otherwise! Thank you!!

Because of course I did. Didn't pay quite enough attention when git add -ping the change (didn't want to spill my app api key accidentally). It's there now.

aspensmonster commented 5 years ago

@Hrxn

From this link:

https://staff.tumblr.com/post/180758987165/a-better-more-positive-tumblr

So what’s next?

Starting December 17, 2018, we will begin enforcing this new policy. Community members with content that is no longer permitted on Tumblr will get a heads up from us in advance and steps they can take to appeal or preserve their content outside the community if they so choose. All changes won’t happen overnight as something of this complexity takes time.

Another thing, filtering this type of content versus say, a political protest with nudity or the statue of David, is not simple at scale. We’re relying on automated tools to identify adult content and humans to help train and keep our systems in check. We know there will be mistakes, but we’ve done our best to create and enforce a policy that acknowledges the breadth of expression we see in the community.

I'm reading "steps they can take to appeal or preserve their content outside the community" to mean that, eventually --perhaps not on the 17th two weeks from now, but eventually-- the content will be deleted. And yes, this kind of filtering "is not simple at scale." I'd argue it's not possible at scale. There's already plenty that's getting mis-flagged.

Hrxn commented 5 years ago

Damn. Thanks. Agreed, I read it in the same way. A real shame.

Laydmei commented 5 years ago

A little question from a beginner: I tried the command _"./tumblr_backup.py --likes --outdir=/home/some/path/tumblr_backup_likes/ --save-video --save-audio -j --exif=' some_blogname'" but it does not work. How can I download all my likes? Because I can not download more than 1000 likes. ?_?

vl09 commented 5 years ago

A little question from a beginner: I tried the command _"./tumblr_backup.py --likes --outdir=/home/some/path/tumblr_backup_likes/ --save-video --save-audio -j --exif=' some_blogname'" but it does not work. How can I download all my likes? Because I can not download more than 1000 likes. ?_?

Be sure to make your blog explicit first in the settings. Somehow, the code works better that way.

cebtenzzre commented 5 years ago

It "works better" because AFAIK a non-explicit blog will not publicly show explicit likes.

Laydmei commented 5 years ago

I have checked. I tried again. But it still does not download all my Likes. Maybe I have too many Likes? I really do not understand.

Doty1154 commented 5 years ago

Huh, yeah i still get rate limited after the first couple hundred likes. i got like 40 gig down so far. Cries

Laydmei commented 5 years ago

An idea of how to solve the problem? Please ?

Hrxn commented 5 years ago

Did you make the code changes (as seen under Files changed)?

cebtenzzre commented 5 years ago

@Hrxn Lol, manually patching? Why not just clone aggroskater/tumblr-utils?

Hrxn commented 5 years ago

Yes, obviously. I meant to ask to verify if you are running indeed the changed version. Did you verify the code changes..

KlfJoat commented 5 years ago

I just added this and it works. Great job!

allefeld commented 5 years ago

@aggroskater I used your version, and it works much better for likes than the original, thank you!

However, I still only successfully downloaded 5593 of 9626 likes, i.e. I'm missing 4033.

While downloading, the program listed 44 times "HTTP Error 403: Forbidden", most of them with URLs referring to the domain vtt.tumblr.com.
It listed 7 times "HTTP Error 404: Not Found", all of them referring to the domain 66.media.tumblr.com.
It listed 6 times "WARNING: Could not send HEAD request" with an URL beginning with https://www.tumblr.com/privacy/consent?redirect=. In these cases, messages followed saying that the program is "Falling back on generic information extractor." and "Unable to download video".

I understand that these 44 + 7 + 6 = 57 likes may simply be inaccessible (I tried some of the URLs manually and could verify that), but that accounts only for a tiny fraction of the 4033 likes that were skipped without warning.

Is there any chance you can fix this? If I can do something to help, please tell me.

adamamyl commented 5 years ago

While downloading, the program listed 44 times "HTTP Error 403: Forbidden", most of them with URLs referring to the domain vtt.tumblr.com.

@allefeld If your behaviour is like what I've seen elsewhere with 403 errors on videos, I think it is that the videos no longer exist / are not accessible anymore – i.e. it's not a problem with this script or youtube-dl.

Hrxn commented 5 years ago

A lot of videos are indeed 'removed', or to be more exact, blocked. That is, 403 is the expected result in those cases. This was the big Tumblr video purge a while ago..

allefeld commented 5 years ago

@adamamyl @Hrxn you're probably right with your comments. However, that doesn't explain 4033 - 57 = 3976 likes aren't downloaded and don't have a corresponding error message.

Doty1154 commented 5 years ago

If only it was possible just the download all of https://www.tumblr.com/likes while logged in via cli,

allefeld commented 5 years ago

I now think these inconsistencies have nothing to do with @bbolli's or @aggroskater's code, but with the extremely weird and unreliable tumblr API.

I experimented a bit with the API myself. I went through the list first using the query/next value for the next request, and found it skips over likes. I then used the field liked_timestamp from the last returned post, that worked a little better. I experimented with the limit parameter, and found that for a small value, resulting in a lot of requests, at some point the API simply starts to return 0 posts, even though I know that the requested time point has many likes before it. Mind you, no error message, just an "OK" response containing zero posts.

I used bbolli's code, aggroskaters's fork, https://github.com/javierarce/tumblr-liked-photos-export, and my own code, and I never arrive at the 9000+ likes, get different numbers of recovered posts each time and on different runs.

I'll be turning my back on tumblr soon, and just wanted to get my stuff out before the impending apocalypse. After the both social and technical blunders they commit, I have to say: Good riddance.

Sorry for venting. Thank you for your work!

cherryband commented 5 years ago

I made the same thing some months ago but never thought of pulling over here. My implementation is #165 and it resolves the first 2 issues @aggroskater has and more. Hope it is helpful!

aspensmonster commented 5 years ago

I'll take a stab at incorporating @qtwyeuritoiy's work into my fork. Between the fixes for the first two issues, and the fact that a tag index feature is now upstream --I've already rebased onto latest upstream-- my original issues are resolved.

aspensmonster commented 5 years ago

I've got the pieces initially merged. It'll take a few hours to do a full grab and then test after liking some other posts.

Sidenote, it seems that the "mark as sensitive" feature, or whatever, is... no longer available in the desktop website's settings. I can't find it anywhere. That also might be playing havoc with downloading likes for all I know. Might break down and try the oauth approach at some point tonight/tomorrow. But that's its own can of worms that'll entail pulling in some library that can support oauth1.0a's HMAC signing mechanism on the requests.

aspensmonster commented 5 years ago

Had to adjust some things. Mainly, the incremental backup still wasn't working as I expected. Maybe I misunderstood the code from #165, but from what I can tell, ident_max was still operating on post identities, and not the time-of-like. I bit the bullet and figured out a solution that works with the newer code. The UTC date that gets saved inside the individual post html files is now the time-of-like for "like" runs with the new code (I added a class attribute on the time element to indicate as much too). So, we can interrogate all of the saved files to find the latest liked post date and use that as ident_max when supporting the incremental feature. And it looks like it works. Granted, the API definitely seems to put the "eventual" in "eventual consistency," because I've seen it take upwards of a few hours to actually get the "latest likes" (i.e., I'd like a few things, run --incremental a few minutes later and get nothing, wait a few more minutes and nothing, hang my head in defeat and go grab dinner, get back and run again, and yay! the latest likes are there). Not typically though. More like five minutes usually.

The tweaks to the post.date property are now causing the likes to be rendered to html in the order they were liked, as expected. And the additional header piece indicating the original author of the post is also included for like runs (for normal blog backups it's not included of course).

The tag index feature is upstream, and I have my own approach for solving #144 that I'll probably include in my latest commit (can back it out later if needed).

And that's... all of the original deficiencies I had thought of at least.

Depending on what gets merged if/when, I can adjust this PR more should the maintainer want to merge the changes. I haven't done thorough testing of existing features yet though, so I wouldn't blindly merge.

I've kicked off another full backup for testing. And assuming it runs well, I'll push my latest code.

cherryband commented 5 years ago

@aggroskater Yes, I never touched the incremental backup part so it makes sense it doesn't operate on liked_timestamp. Great job on making the dates work. Another thing I want to see incorporated here is a original blog name in the header as in #168.

aspensmonster commented 5 years ago

:+1: @qtwyeuritoiy Yep :D I have that bit in there too in my local code changes.

aspensmonster commented 5 years ago

My code for getting the latest timestamp was off --strftime('%s') passes through to local implementation which uses local TZ instead of UTC-- but after patching that up, it looks like a full backup and incremental backup are working. This explains my earlier observations of latest likes "taking forever". Shows me for blaming the API instead of myself... As of now, I can typically sync the latest likes with --incremental within a couple minutes of liking at most.

I've got some final testing to do, but I'm reasonably confident that things are working. Going to push my latest changes to my fork now. I'll try to rebase on top of upstream's latest changes at some point tomorrow.

Edit: Judging by issue #167, some of the new code to try and handle NPF stuff might be causing the script to fail. My fork/repo at present doesn't have the NPF handling stuff, but hasn't crashed/failed on me either. I'll keep an eye out for that issue after rebasing, and I'll try the stop-gap "throw the raw json in a <pre/> tag" measure if I run into the issue myself.

allefeld commented 5 years ago

@aggroskater, I downloaded your version of tumblr-backup.py today (2018-12-11) and tried to use it incrementally (first complete download, then add new likes), and got the following error message:

$ tumblr_backup.py --dirs --save-video --save-audio --likes --outdir=likes -i xyz
Traceback (most recent call last):
  File "./code/tumblr_backup.py", line 1274, in <module>
    tb.backup(account)
  File "./code/tumblr_backup.py", line 568, in backup
    fh = open(f,'r')
IOError: [Errno 21] Is a directory: u'/home/ca/Store/lab/byetumblr/likes/posts/721441137749'

(blog name and post number changed for privacy)

To me it looks like incremental mode isn't working anymore?

aspensmonster commented 5 years ago

Thanks for helping test! I'm guessing this has to do with the directory-per-post feature (--dirs flag). My tests didn't use that flag, and the code is operating on the assumption that f is a file. I'll try to fiddle with it to work with the --dirs flag.

On December 11, 2018 8:02:35 PM UTC, Carsten Allefeld notifications@github.com wrote:

@aggroskater, I downloaded your version of tumblr-backup.py today (2018-12-11) and tried to use it incrementally (first complete download, then add new likes), and got the following error message:

$ tumblr_backup.py --dirs --save-video --save-audio --likes
--outdir=likes -i xyz
Traceback (most recent call last):
 File "./code/tumblr_backup.py", line 1274, in <module>
   tb.backup(account)
 File "./code/tumblr_backup.py", line 568, in backup
   fh = open(f,'r')
IOError: [Errno 21] Is a directory:
u'/home/ca/Store/lab/byetumblr/likes/posts/721441137749'

(blog name and post number changed for privacy)

To me it looks like incremental mode isn't working anymore?

aspensmonster commented 5 years ago

@allefeld I did a quick and dirty test of the (possible) fix I just pushed to my fork. From what I can see, the incremental backup works when using the --dirs option now. Let me know if works for you or not.

allefeld commented 5 years ago

@aggroskater yes works for me, thanks!

There's a long delay at the beginning, I guess while it checks the posts that are already present. Not really important, but some kind of progress report might be good.

cebtenzzre commented 5 years ago

So far this has been working fine for me with MAX_LIKES set to 50. EDIT: It looks like it's only getting ~23 posts every iteration... but that's still more than 20.

aspensmonster commented 5 years ago

I'm working on rebasing on upstream, but there are some lingering issues with NPF posts that I'd rather not commingle with this fork yet. In particular issue #167 and #162. I'm not sure if #162 is really fixed yet; I think I've got cases where NPF videos still aren't getting downloaded locally. I might spin off another branch on my fork to try and deal with them. The documentation on NPF is lengthy (https://www.tumblr.com/docs/npf).

As is, the fork is working well enough for backing up likes. I'd also like to figure out an oauth way to do it, but time is running out, honestly.

I have MAX_LIKES at 20 based on the official API documentation. There are varying reports about just how many likes can make it through in a call, but I figure sticking to the official limit should be (somewhat) safer.

aspensmonster commented 5 years ago

I've got an alternate approach for dealing with the NPF stuff built locally. Rather than try to parse the NPF payloads in the data-npf attribute --various issues (#167, #172, #179) in this repo indicate that its format is not reliable-- instead I have the code getting the video inline the same way that we get images inline for "text" posts.

Incidentally, in the process of building this, I found that there is an existing bug in the inline image handler. The regex locks onto the data-orig-src attribute of an img tag instead of the src attribute. I've included a small fix for that too (though the real fix is to never parse html with regex). Doesn't look like it happens super often, but hey, the theory is I shouldn't see any image elements with http links.

Assuming I can get a full archive to run to completion, and can get a few other blogs archived that have been mentioned across a couple issues in this repo, I'll push my latest rebased code.

aspensmonster commented 5 years ago

Ok. I've got a full archive to run to completion, no obvious flaws. I'm running once more without all of my messy debug output, but in the meantime, I've pushed my latest changes, rebased on upstream. Hopefully this addresses the various issues mentioned above (videos still not downloading, broken/incomplete tag indices, missing type/subtype in NPF post problems...).

aspensmonster commented 5 years ago

I've merged current upstream/master into my master (trying to preserve commit references now that they're in upstream commits; otherwise I would have rebased). I'm running another backup, but otherwise I think the code is good enough to merge at this point. At least, my original hang-ups have all been tackled in the past few days:

Besides adding support for archiving likes (both full and incremental, with and without --dirs option), it also uses a sha256 hash of the tag to make the folder name for the tag (nothing but letters and numbers, not too long of a filename), which should address issues in #140 and #183. The trade-off is that the folder name isn't "friendly", but I think trying to make friendly representations of some of the crazy tags on tumblr --that also conform to varying filename restrictions put in place by different OS's and filesystems-- is an uphill battle, and if you're only using the rendered HTML tag index, then the sha256 folder name isn't a concern.

@bbolli Let me know what you think.

aspensmonster commented 5 years ago

Hmm. Might have found a problem, might not. I'll update in a bit.

aspensmonster commented 5 years ago

Alright. Two small commits later and it looks like I'm back in business, at least by a cursory glance of the posts/ directory as I'm running another full backup.

beret commented 5 years ago

Be sure to make your blog explicit first in the settings. Somehow, the code works better that way.

Sidenote, it seems that the "mark as sensitive" feature, or whatever, is... no longer available in the desktop website's settings. I can't find it anywhere. That also might be playing havoc with downloading likes for all I know. Might break down and try the oauth approach at some point tonight/tomorrow. But that's its own can of worms that'll entail pulling in some library that can support oauth1.0a's HMAC signing mechanism on the requests.

Edit: this appears to raise the amount of likes listed in the topline number, but does not actually allow those likes to be seen and saved.

I found a way to work around the removal of this from the web UI.

Go to your blog settings eg. https://www.tumblr.com/settings/blog/YOURUSERNAME

Just inspect one of the form buttons which is already off (i used 'show only on dashboard') and replace the for value with tumblelog_flagged_as_nsfw As in picture:

inspector screenshot 20181215

Then go back to the dash, and toggle the control.

dashboard corntrols

It saved (I could see a 200 response from the API) and i was then able to see my full number of posts when querying.

<3

ann4belle commented 5 years ago

@beret for whatever reason, your hack doesn't seem to be working anymore. Tried it in incognito mode (to unload all extensions, etc) to no avail.

aspensmonster commented 5 years ago

The toggle option is visible in the android app at least, though it appears non-functional. Toggling "Account > Visibility > $blog_name is explicit" to on doesn't cause the blog to become masked with the explicit screen to non-logged-in users. Exiting the visibility page of the android app and returning to it shows that the setting is back off.

¯\_(ツ)_/¯

ann4belle commented 5 years ago

Yeah, I think they may have disabled it server-side, meaning nothing we do can change it. For some reason though using the base tumblr_backup gets me far more likes than using this version. Not sure if it uses a different way of accessing the API that lets it download more likes or what, but honestly I think the best bet is to swap to OAuth and use that.

aspensmonster commented 5 years ago

For some reason though using the base tumblr_backup gets me far more likes than using this version

By "base tumblr_backup", do you mean the tumblr_backup.py upstream? Or my fork? Because my fork was originally inspired by the fact that, as is, the upstream tumblr_backup.py stops at ~1000 likes (due to a tumblr API limitation when using offset parameters on the /likes endpoint).

ann4belle commented 5 years ago

The upstream one downloads 1471 of my likes, but yours only gets 366 for some reason. The total is 1484 according to both versions and Tumblr's website.

aspensmonster commented 5 years ago

Strange. On my own blog with 36k+ likes I only get around a 1000 with upstream, but with my fork I get 28k. I figure the difference comes in from deleted posts and explicit ones. I can look into it, but I'd need the full command that was run along with the blog name.

beret commented 5 years ago

@Code-You-Fools Yup. it changes the headline number but the same number of posts are gathered in the end. (edited my post above)

ann4belle commented 5 years ago

The command I ran was "python2 tumblr_backup.py --save-audio --save-video -l -S -O ~/test-backup just-bunbun-things"

It isn't marked explicit which is probably why yours isn't getting all of the likes, but I'm not sure why the upstream version is grabbing more. It's not just a fluke either, they're actually downloaded and everything. Used -S because it sometimes returns an issue with SSL verification.

aspensmonster commented 5 years ago

The command I ran was "python2 tumblr_backup.py --save-audio --save-video -l -S -O ~/test-backup just-bunbun-things"

Ok. I'm able to replicate that behaviour at least:

$ ls posts/ | wc -l
366

I'll try to dig in and see what's going on. My first guess is either more wonkiness with API responses like @allefeld was running into, or some other fault condition that's causing backups to cease and go straight to indexing the stuff that's on-disk.

aspensmonster commented 5 years ago

Alright. I've run with current upstream master (well, commit 85b4ce3). The output says that it backed up 1500 or so likes, but what actually gets archived is less:

$ python2 tumblr_backup.py --save-audio --save-video -l -S -O ~test just-bunbun-things
just-bunbun-things: 1471 posts backed up 
$ ls ~test/posts/ | wc -l
290