bbolli / tumblr-utils

Utilities for dealing with Tumblr blogs, Tumblr backup
GNU General Public License v3.0
667 stars 124 forks source link

Only 50 posts are backed up using --incremental #223

Open jorong1 opened 3 years ago

jorong1 commented 3 years ago

Version using, https://github.com/bbolli/tumblr-utils/commit/f8ae83d4c53b1b84f1fbbead4dc51ca117414aad

command: python2.7 tumblr-utils/tumblr_backup.py --incremental --save-video-tumblr --no-ssl-verify --save-audio --json BLOGNAME

Output: 50 posts backed up

I am backing up to a folder with existing post json media archive and index files. Without --incremental it grabs posts no problem but it will probably want to grab all the posts, which I don't want. I do want archive and index.html generated files but cancelling a mass-run doesn't do that.

My last successful --incremental backup was end of September using https://github.com/bbolli/tumblr-utils/commit/08cbe4427e6b0d3735433d68c3fb651380bec31e with my own API key

I use --incremental to avoid overwriting existing older posts and media, and avoid going too far back. If I don't use that will it just stop once it finds existing files? I could use count this time to update, but I'm sure it'll still only incremental grab 50 posts in the future.

I have my own API key and I'm using the latest git for this project. Also tried https://github.com/Cebtenzzre/tumblr-utils/commit/e5537c02bbe0c5c93a561de16d4f7ea65d5bf436 and same 50 post output. Don't know what to do here thanks!

cebtenzzre commented 3 years ago

At the time you ran the incremental backup, was there really more than 50 new posts? --incremental stops reading posts from the API as soon as it sees a post that was already backed up.

Without --incremental, you'll have to wait for the script to read the entire blog, but it will download all missing posts regardless of what is already backed up. It will overwrite all post HTML unless you use --no-post-clobber on my fork. bbolli's version will never redownload media; on my fork, --no-post-clobber skips downloading media for existing posts and --timestamping prevents redownloading media if the file on disk matches the server.

If you at any point just want to regenerate the archive and index.html, use my fork and pass --count 0.

jorong1 commented 3 years ago

I got sidetracked sorry but this is still not working as before. I confirm there are more than 50 posts available. The last update I did was 2020-09-03, and so far there's been a few thousand new posts on the specific blog.

--incremental works as you describe, it only does 50 posts. Then if I try again, it does nothing. The backup on 2020-09-03 was done with bbolli's tools but then I tried an incremental I think and it may have caused a bad stop. But I cleaned up the files as well so that it should have been before whatever update I did.

I just saw the update https://github.com/Cebtenzzre/tumblr-utils/commit/8417f946a44220b0a5b7fd6e11606ce760c89536

And no-clobber works. I am backing up beyond 50 posts now. I did a backup of the folder before but hopefully it doesn't overwrite the media because it's over 50GB.

So now I guess I'd need to do without incremental, no clobber and timestamping which would basically download the entire blog again (which is like 70000 posts) so that the index and stuff are updated and I guess then I can do incremental beyond 50 posts?

Or so where do I go from here for the future? I don't want to download the whole blog again every time, and there are alwayts more than 100 posts.

Can I only grab what is missing and regenerate index and stuff somehow? It's weird that incremental stopped working like it did before, used to grab 100s of posts. I have my own API key

cebtenzzre commented 3 years ago

I forgot to mention that --no-post-clobber was only available on my fork's experimental branch. I just moved it to the main branch, so a fresh clone of my fork will have that option now. In this case you need it to make a quick non-incremental backup: --incremental stops when it sees familiar posts, --no-post-clobber just skips them (and should go far enough to discover posts that weren't backed up). In theory, --no-post-clobber has no effect if you are already using --incremental. It also won't change the behavior of future incremental backups, but it will find posts that were missed by previous incremental backups.

Assuming none of your previous incremental backups were interrupted, it sounds like --incremental could actually be malfunctioning. But it relies on such a simple and seemingly guaranteed property of blogs (older posts have lower IDs, the API returns the newest posts first) that I would need solid evidence to believe it. It would help to have the name of the blog you're backing up, but if you'd prefer, I could write a script to make a list of post IDs and dates from a backup, and you could run it after a backup without --incremental to see if the ordering is broken in a way that confuses --incremental.

slowcar commented 3 years ago

We have reproduced the issue two times, the -i parameter takes only the 50 recent posts into account. We try to run the script daily now, but a fix would be welcome just in case.

cebtenzzre commented 3 years ago

@jorong1 @slowcar If I run these commands, I get the expected result:

$ python2 tumblr_backup.py -s 175 -n 20 just-art
just-art: 20 posts backed up
$ python2 tumblr_backup.py -i just-art
just-art: 175 posts backed up

The first command makes a small backup that is 175 posts out of date, and the next command backs up the missing posts using --incremental. More than 50 posts are backed up. This is probably a blog-specific issue, so it would be helpful if someone could provide an example of a blog where after running these two commands in a clean directory, the second command reports "50 posts backed up".

jorong1 commented 3 years ago

Hi. I finally did another blog backup with https://github.com/Cebtenzzre/tumblr-utils/commit/ce10f296a20fea59d053fd66089901a7049673f1 and it seems incremental is working properly now.

(py3venv-cebtenzzre) $ python tumblr-utils-cebtenzzre/tumblr_backup.py --incremental --no-ssl-verify --save-audio --save-video-tumblr --json --no-post-clobber --timestamping myblog myblog: 2851 posts backed up

My process was to do a full non-incremental backup like @Cebtenzzre recommended, followed by an incremental. This "reset" wherever I was stuck. Now I did a new incremental without issue.

I don't think the issue should be closed because @slowcar is having an issue with it, so it must be something else, but I am good now so I don't mind closing it. I can't provide a blog example because the issue happened with my personal blog I don't feel comfortable sharing.

The solution here would be to possibly using Cebtenzzre's fork if you're having issues. I don't know if that's appropriate to recommend.

jorong1 commented 3 years ago

I was hoping I wouldn't have to post this, but I am now getting the same error again. I'll try a full redo, which sounds like the proper solution here, but I got the same 50 post error and I know there's a gap: Newest post from last run: 01/24/2021 11:32:24 AM Oldest post from current run: 02/05/2021 08:55:41 AM

There's a good couple weeks of content there missing. I know this can't be reproduced, just commenting it's still happening. Sorry to raise hope.