TumblThreeApp / TumblThree

A Tumblr and Twitter Blog Backup Application
https://TumblThreeApp.github.io
MIT License
624 stars 75 forks source link

Twitter Rate Limit is still an issue #212

Closed T-prog3 closed 2 years ago

T-prog3 commented 2 years ago

Describe the bug Still being Rate Limited on the twitter API with a suggestion to lower the connections in Setting. This however makes no difference at all. I tried as low as 10 Numbers of connections in 60s with only 1 Concurrent connection. To my understanding of the Twitter API https://developer.twitter.com/en/docs/twitter-api/rate-limits this shouldn't be an issue?

This also raises the questions, if the settings only effect the Tumblr API? Should both Tumblr and Twitter really be treated under the same settings and name?

And shouldn't there also be a way to Authenticate a Twitter account? This would allow you to crawl users that only allow followers.

Desktop (please complete the following information):

thomas694 commented 2 years ago

Today and right at the moment it's working (here). Did you already download a bit when you got the error? Then it would be a real "limit exceeded". Or your current IP may be blocked for some reason. Or they are rolling out a new change which you already see and other regions will see it soon.

The settings affect the crawlers for both. Also the default settings are in absolute terms a bit too high for Twitter's API limits, but they work for normal crawling/downloading because of time spent between requests. Obviously there are no separate settings yet, maybe needed in the future.

There is room for improvements. Contributions are welcome.

T-prog3 commented 2 years ago

I actually have been trying to update one user who have already been downloaded once two weeks ago. The error is happening the first minute of running during Evaluated N tumblr posts out of N total posts. It doesn't download anything new and then i get Error 1: Limit exceeded: username You should lower the connections to tumblr API in the Settings ->Connection pane.

Then i get the message waiting until date/time but at that time it only push the date/time forward and doesn't make any progress even after 1 hour. So it appears to be no way to make a complete update of already downloaded users (as of today). My Last Complete Crawl will continue be stuck at 2022-01-20.

thomas694 commented 2 years ago

Please open this blog in the browser and tell me when the first two posts have been posted. Do you have "force rescan" enabled in this blog's settings? What is the value of LastId in this blog's index file?

T-prog3 commented 2 years ago
  1. The two latest Tweets both on Feb 3. There are around 60 Tweets since the Last Completed Crawl and the user have a total of 8,796 Tweets.
  2. force rescan is not enabled. However i still think that the software acts in such a way as if this setting was enabled. It's always Evaluated 3500 tumblr posts out of 8,796 total posts when `Limit exceeded.
  3. 1483991637554614277
thomas694 commented 2 years ago

At the moment I don't have a clue why it's crawling that much on this blog. Do you have a value inside blog's "download pages" setting?

T-prog3 commented 2 years ago

No, i have almost everything on default settings. The only things i have changed in the software is

General: Active portable mode Enabled

Connection: Concurrent connections 1 Concurrent video connections 1

Limit Tumblr API connections: Number of connections 30 Limit Tumblr SVC connections: Number of connections 30

Blog: Download reblogged posts Disabled Image size (category) Large Video size (category) Large

thomas694 commented 2 years ago

It seems some error occurs during the crawl process that keeps it from updating LastId to the newest post. You could have a look into the TumblThree.log file, whether you see a hint/error there.

T-prog3 commented 2 years ago

This is the error in TumblThree.log

You should lower the connections to the tumblr api in the Settings->Connection pane., System.Net.WebException: The remote server returned an error: (429) Too Many Requests. at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult) at System.Threading.Tasks.TaskFactory1.FromAsyncCoreLogic(IAsyncResult iar, Func2 endFunction, Action1 endAction, Task1 promise, Boolean requiresSynchronization) --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at TumblThree.Applications.Extensions.TaskTimeoutExtension.d0`1.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at TumblThree.Applications.Services.WebRequestFactory.d12.MoveNext() in C:\projects\Tumblthree\src\TumblThree\TumblThree.Applications\Services\WebRequestFactory.cs:line 129 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at TumblThree.Applications.Crawler.TwitterCrawler.d25.MoveNext() in C:\projects\Tumblthree\src\TumblThree\TumblThree.Applications\Crawler\TwitterCrawler.cs:line 257 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at TumblThree.Applications.Crawler.TwitterCrawler.d24.MoveNext() in C:\projects\Tumblthree\src\TumblThree\TumblThree.Applications\Crawler\TwitterCrawler.cs:line 236 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at TumblThree.Applications.Crawler.TwitterCrawler.d28.MoveNext() in C:\projects\Tumblthree\src\TumblThree\TumblThree.Applications\Crawler\TwitterCrawler.cs:line 339 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at TumblThree.Applications.Crawler.TwitterCrawler.d30.MoveNext() in C:\projects\Tumblthree\src\TumblThree\TumblThree.Applications\Crawler\TwitterCrawler.cs:line 364 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at TumblThree.Applications.Crawler.TwitterCrawler.d__33.MoveNext() in C:\projects\Tumblthree\src\TumblThree\TumblThree.Applications\Crawler\TwitterCrawler.cs:line 456

thomas694 commented 2 years ago

This blog downloads without problems here. Even if I try to emulate your situation by adapting the settings and blog file accordingly, it downloads the posts until the one from last time and stops. I don't know what could be the difference to your system.

You could backup the blog's download folder and its two blog files. Then you can add the blog again and see, whether the blog works again and download the missing new posts. Later you can close the app and merge in the backed up files and the already downloaded entries in "blog"_files.twitter from the copy to the current one (just all entries, a few duplicates are ok).

T-prog3 commented 2 years ago

Report from start to end:

  1. I just downloaded the latest TumblThree-v2.5.1-x64-Application.zip
  2. Unzipped and opened TumblThree.exe
  3. Without changing any default settings at all i added some random users with large amount of tweets (5000+)
  4. Enqueued all added users and pressed Crawl
  5. It started to download files from the first user
  6. Got 4151 files (3944 video/images + texts.txt)
  7. Then the error occurred. Error 1: Limit exceeded: username You should lower the connections to tumblr API in the Settings ->Connection pane.
  8. Apparently this twitter user had 50864 posts so nowhere near completion and still 3 other users to go.
  9. Waited until waiting until date/time
  10. Got a new waiting until date/time
  11. I pressed Stop
  12. Got a new status saying Calculating unique downloads, removing duplicates ...
  13. This took forever and 20 minutes later i terminated the software.
  14. Started the software again
  15. Enqueued all users again and pressed Crawl
  16. It started with the same first user again But this time showed something about File Already downloaded.... Skipping
  17. It got to the point where it started to download some new files
  18. Now i have 4174 files downloaded (3964 video/images + texts.txt)
  19. After these 23!!!! (20 video/images) new files was downloaded the error occurred again.
  20. Error 1: Limit exceeded: username You should lower the connections to tumblr API in the Settings ->Connection pane.
  21. Terminated the software

Conclusions:

  1. The twitter part of the software works to a certain limit. But will take forever to get any files beyond the limit. With only 20 new files the second time around it will take days to complete the first user if it ever succeed to the finish line.

  2. All skipped files seems to be counted as a request that adds to the limit counter.

Log: No TumblThree.log to be found in the TumblThree-v2.5.1-x64-Application folder.

thomas694 commented 2 years ago

Ok, but now we are talking about a different thing, isn't it? It's no longer about downloading a few dozen recent posts, but downloading historic posts (resp. complete blogs). Twitter doesn't want more posts than a certain limit to be downloaded. Obviously they changed something. We have to see, whether we can find a solution or not.

The download of the "post lists" counts towards the limit, whether a post's media is downloaded or skipped.

T-prog3 commented 2 years ago

To my understanding:

I see no difference between updating an already downloaded blog and complete new download. Both have the same amount of Number of posts in the active users download queue.

In other words, you will never be able to update/download the second blog in the download queue if the first user have a large amount of Number of posts. The problem is that the software does a request to each and every post the user have no matter if you do an update or download a new user. So you do not only get the recent 100 posts you haven't downloaded yet. You get the full blog in the queue no matter what.

The problem with updating a blog would not be a problem if you only got the recent posts between Now and Last Complete Crawl in the queue.

Problem summary:

thomas694 commented 2 years ago

First, you experience resp. describe something that I don't see here. Looks like most other users can update their existing blogs too.

The problem with updating a blog would not be a problem if you only got the recent posts between Now and Last Complete Crawl in the queue.

That's exactly what we're doing, precisely LastId (after a successful complete crawl).

In other words, you will never be able to update/download the second blog in the download queue

Not automatically and unattended, yes. You can, for example, remove this blog from the download queue, which stops its crawler and continues with the next one.

Let me summarize what I get (and probably others too):

The last point needs to be fixed, so that all posts to the limit are downloaded and then the blog is marked as completely downloaded. This limit exists...[#161] That a workaround will not work forever should be clear and understandable.

Obviously they changed something. We have to see, whether we can find a [workaround] solution or not.

If you know how to fix it, you are welcome to do so (or share it).

@Hrxn @desbest @cr1zydog I hope you don't mind. Can you still update your existing twitter blogs?

Hrxn commented 2 years ago

@Hrxn @desbest @cr1zydog I hope you don't mind. Can you still update your existing twitter blogs?

I've never used Twitter with this App before, so my own experience here is a little limited.

That said, what you state here is obviously true:

  • Small blogs can be downloaded and updated without problems.
  • Any reasonably up-to-date blog can be updated without problems.
  • Only big blogs can no longer be downloaded completely and thus updated later. Experienced users could at least update them with a little tweaking (LastId).

The last point needs to be fixed, so that all posts to the limit are downloaded and then the blog is marked as completely downloaded. This limit exists...[#161] That a workaround will not work forever should be clear and understandable.

Obviously they changed something. We have to see, whether we can find a [workaround] solution or not.

The third point is the real issue, as I understand it, and yes, this is a limitation due to how Twitter works.

desbest commented 2 years ago

I can't download any blogs, new or old, few posts or large.

Someonemustnothavethis commented 2 years ago

I had this problem several months ago but it's not bothered me since and I didn't change anything other than the routine TumbleThree updates. I catch-up with all my Tumblr blogs once a month and add any newly discovered ones. I'm now following 257 Tumblr blogs (In know, I'm hooked!), and the last catch-up on the first of the month was 147 GB and 404,000 files. It took almost 24 hours to harvest everything, but ran perfectly.
I'm using all default settings.