bbolli / tumblr-utils

Utilities for dealing with Tumblr blogs, Tumblr backup
GNU General Public License v3.0
667 stars 124 forks source link

Backup of likes doesn't stop at expected count #217

Open arete06 opened 3 years ago

arete06 commented 3 years ago

I just installed tumblr-utils and, trying to download all my likes, it just keeps running even after passing the expected posts. Example:

blogname: Getting posts 1050 to 1099 (of 1410 expected)

blogname: Getting posts 1100 to 1149 (of 1410 expected)         

blogname: Getting posts 1150 to 1199 (of 1410 expected)        

blogname: Getting posts 1200 to 1249 (of 1410 expected)       

blogname: Getting posts 1250 to 1299 (of 1410 expected)        

blogname: Getting posts 1300 to 1349 (of 1410 expected)        

blogname: Getting posts 1350 to 1399 (of 1410 expected)           

blogname: Getting posts 1400 to 1449 (of 1410 expected)              

blogname: Getting posts 1450 to 1499 (of 1410 expected)           

blogname: Getting posts 1500 to 1549 (of 1410 expected)           

blogname: Getting posts 1550 to 1599 (of 1410 expected)              

blogname: Getting posts 1600 to 1649 (of 1410 expected)             

blogname: Getting posts 1650 to 1699 (of 1410 expected`)
cebtenzzre commented 3 years ago

If it's caused by a regression, I have a suspicion commit 4961a2f64a8fe17fae2afb412b4222d1202a45a2 (script here) would work better for you. But chances are https://github.com/aggroskater/tumblr-utils still has better support for likes.

cebtenzzre commented 3 years ago

It's possible that this condition is failing: https://github.com/bbolli/tumblr-utils/blob/da3370c157cc40449c852240fdb2fd43cd848020/tumblr_backup.py#L223

You could try replacing that line with this more verbose code:

if doc.get('meta', {}).get('status', 0) != 200:
    sys.stderr.write('API response has non-200 status:\n{}\n'.format(doc))
    return None
return doc
bbolli commented 3 years ago

I don't get what the more verbose code does differently, ecxept for printing the document if the response code is not 200. How would this help in this situation?

arete06 commented 3 years ago

If it's caused by a regression, I have a suspicion commit 4961a2f (script here) would work better for you. But chances are https://github.com/aggroskater/tumblr-utils still has better support for likes.

I tried this commit and it did indeed finish and created the html file. However, it clearly did not save all my likes so it's still not working for me.

arete06 commented 3 years ago

Actually, it now seems more likely to me that this condition is failing:

https://github.com/bbolli/tumblr-utils/blob/da3370c157cc40449c852240fdb2fd43cd848020/tumblr_backup.py#L223

~Chances are your likes are private (not yet supported, see #128), but maybe something else is going on.~ (edit: this would yield an HTTP Error 401, so it wouldn't explain this). You could try replacing that line with this more verbose code:

if doc.get('meta', {}).get('status', 0) != 200:
    sys.stderr.write('API response has non-200 status:\n{}\n'.format(doc))
    return None
return doc

I actually have some code like that in my own version of the script.

This did not work either.

cebtenzzre commented 3 years ago

@bbolli If that None return was causing the backup loop to "try the next batch", it would be good to know what the response status was for debugging reasons (even if HTTP status was 200). But I forgot that the loop skips single posts now, so that wouldn't cause this.

cebtenzzre commented 3 years ago

@sldx12 If the older commit worked for you then try the latest version of tumblr-utils, which has a potential fix for this issue. If neither gets all your likes then aggroskater's might -- it walks them by timestamp instead of offset.

arete06 commented 3 years ago

@Cebtenzzre none of those worked. The older commit keeps having the same bug and aggroskater's one doesn't get all likes.

cebtenzzre commented 3 years ago

Does the latest version (download it fresh from GitHub or update your clone if you made one) still try to download more likes than expected? If that's fixed, we can close this issue and open a new one for not downloading all of the likes.

arete06 commented 3 years ago

Yes, the latest version still tries to download more likes than expected.

cebtenzzre commented 3 years ago

I can't reproduce the issue on a test blog with ~30 likes - I thought I could at one point but I realized I didn't have enough likes to prove my theory. @bbolli I have a suspicion this is because of 29e4c8497df2cf2c7bc4c06e60a080a836b84af3 being effectively reverted by dd40a8894cda36444154670fcc913887127bb6bb. Did you ever find that commits to be necessary? If less than MAX_POSTS likes are being backed up at a time (maybe they started selectively enforcing the 20 post limit?) then the backed-up count could read as high as 3525 before it stops. @sldx12 You could test this by adding this line before i += MAX_POSTS on the latest version:

print len(posts)
arete06 commented 3 years ago

@Cebtenzzre I'm not sure if this what you asked me but here's the result I got:

blogname: Getting posts 0 to 49 (of 1410 expected) 41 blogname: Getting posts 50 to 99 (of 1410 expected) 35 blogname: Getting posts 100 to 149 (of 1410 expected) 36
blogname: Getting posts 150 to 199 (of 1410 expected) 39
blogname: Getting posts 200 to 249 (of 1410 expected) 42
blogname: Getting posts 250 to 299 (of 1410 expected) 37
blogname: Getting posts 300 to 349 (of 1410 expected) 38
blogname: Getting posts 350 to 399 (of 1410 expected) 39
blogname: Getting posts 400 to 449 (of 1410 expected) 37
blogname: Getting posts 450 to 499 (of 1410 expected) 38
blogname: Getting posts 500 to 549 (of 1410 expected) 39
blogname: Getting posts 550 to 599 (of 1410 expected) 39
blogname: Getting posts 600 to 649 (of 1410 expected) 30
blogname: Getting posts 650 to 699 (of 1410 expected) 36
blogname: Getting posts 700 to 749 (of 1410 expected) 36
blogname: Getting posts 750 to 799 (of 1410 expected) 42
blogname: Getting posts 800 to 849 (of 1410 expected) 40
blogname: Getting posts 850 to 899 (of 1410 expected) 41
blogname: Getting posts 900 to 949 (of 1410 expected) 46
blogname: Getting posts 950 to 999 (of 1410 expected) 31
blogname: Getting posts 1000 to 1049 (of 1410 expected) 43
blogname: Getting posts 1050 to 1099 (of 1410 expected) 43
blogname: Getting posts 1100 to 1149 (of 1410 expected) 43
blogname: Getting posts 1150 to 1199 (of 1410 expected) 43
blogname: Getting posts 1200 to 1249 (of 1410 expected) 43
blogname: Getting posts 1250 to 1299 (of 1410 expected) 43
blogname: Getting posts 1300 to 1349 (of 1410 expected) 43
blogname: Getting posts 1350 to 1399 (of 1410 expected) 43
blogname: Getting posts 1400 to 1449 (of 1410 expected) 43
blogname: Getting posts 1450 to 1499 (of 1410 expected) 43
blogname: Getting posts 1500 to 1549 (of 1410 expected) 43

cebtenzzre commented 3 years ago

Yeah, that's what I wanted to see. I see two problems:

  1. You are not retrieving all of your likes, either because Tumblr's API won't give them to you, or because the script is skipping them. Less than 50 posts per response could explain either.
    • Potential fix (try this): Replace i += MAX_POSTS with i += len(posts) and see if more posts are backed up this way (by number of files in the posts folder). This would align with the older commit's behavior but is not how the API is supposed to work.
  2. len(posts) gets stuck at 43. Maybe this indicates no new posts? If you have only ~805 files in your posts folder and not ~1235, that's more evidence of this. The older commit compared the total len(posts) against the expected count, but if the API gets stuck that limit can no longer break the cycle.
    • Potential fix (try this): Track (or even use) _links since it works for the aggroskater fork. Put some code before posts = _get_content(soup) so it looks like this:
      try:
      print '\nnext before is {}'.format(soup['response']['_links']['next']['query_params']['before'])
      except KeyError:
      print '\nno next before, should probably stop'
      posts = _get_content(soup)
arete06 commented 3 years ago
  1. Potential fix gives the following: https://pastebin.com/izfAibgV
  2. Potential fix gives the following: https://pastebin.com/FKkwcc7u
cebtenzzre commented 3 years ago

@sldx12 Try the script from PR #219 and see if it stops on its own. Also, sorry if I wasn't clear: The first potential fix exists because you said "it clearly did not save all my likes" - I want to know if that change will allow either the older commit or my PR to save more (or all) of your likes.

arete06 commented 3 years ago

@Cebtenzzre Oh, sorry. I don't think that the first potential fix saved all my likes. It's hard to know because, since I have to stop the script, it doesn't generate the html file. However, I looked at the media folder and it didn't look like it had all the likes.

The script from PR #219 stopped on its own, gave the following output and did not save all my likes: https://pastebin.com/ktJPmL89

cebtenzzre commented 3 years ago

Leave this issue open so it will be closed when (if?) the PR is merged. The issue of not all likes downloading even with the PR is probably Issue #118, so discuss that there. According to that issue, the offset parameter is limited to 1,000 for likes, which explains why anything past offset=1000 is the same as offset=1000; I had forgotten about this as I use aggroskater's fork for likes anyway.

If not even aggroskater's fork backs up all of your likes, feel free to open a new issue.