TumblThreeApp / TumblThree

A Tumblr and Twitter Blog Backup Application
https://TumblThreeApp.github.io
MIT License
624 stars 75 forks source link

Tumblr v1 API has new day limits #46

Closed Klimax closed 4 years ago

Klimax commented 5 years ago

Tumblr has imposed new day limits on v1 APIs. Each version of API (XML vs JSON) has its own limit. Precise info is not available so it is not clear how many requests can be made and when it resets.

Note: I have rewrote original bug report.

willemijns commented 5 years ago

I knew his author had the idea or permits to download data by simulating a classical browser connexion....

Klimax commented 5 years ago

After some observations it seems limit is around 200 000 posts. (Which is insuficcent for some of giant blogs that can easily be around double or tripple of that number)

One way around would be resumable scanning where application would pause when error was recieved and continue scanning next day after reset.

willemijns commented 5 years ago

I have only 4434 posts and programs stops at 4328...

Klimax commented 5 years ago

Bump. Can we get this fixed? Following blog cannot be downloaded (Has more then 400 000 posts): https://nyenke.tumblr.com/

It cannot be even properly scanned.

johanneszab commented 4 years ago

Tumbex.com seems to be using the same interface for scraping than we do (the tumblr v1 api). I guess that's why they again lowered the api limits. There is no fix for that.

But you can already lower the API access limits via the settings -> connection tab -> Limit Connection to Tumblr API. There you have to try a lower value pair than the 90 / 60 values.

I've noticed that there are no more hidden (non-public) tumblr blogs than there used to be. Everything can now be accessed via the Tumblr V1 api (hence tumbex using it). Thus, I've added code which allows you to change the scraping implementation in the details panel. For Tumblr blogs, you can now select between the

You can access regular blogs using the SVC as well. I've never tested if this service is rate limited, or if it allows higher rates (i.e. more connections per time) than the regular Tumblr V1 API. Thus, maybe it's worth a try for your large blogs.

willemijns commented 4 years ago

I will try in a couple of months because i will need a backup near june 2020 now ;)

Klimax commented 4 years ago

Pretty sure Limit Connection has no effect for v1 limits. At least there was no observable difference when I tried to lower them. I will try SVC. I got some very large blogs to download, so any rate limits should be quickly discovered.

johanneszab commented 4 years ago

If you use tumbex.com for browsing new blogs, displaying any content on that site will also use the Tumblr API V1, hence reducing your rate limit. It's probably not much, but I just wanted to mention it.

You can see the the requests made in the Developer Tools of your browser (Ctrl-Shift-I -> Network tab).

156420591 commented 4 years ago

First thanks to all you guys' work. March 23th 2020 I have downloaded a gaint blog like 30,000 posts, Since March 24th 2020 I faced "Limit exceeded: yourTumblrBlogName. You should lower the connections to the tumblr api in the Settings->Connection pane." And today March 25th 2020 I still face such issue. Even I change concurrent threads to 1. So I can't download blogs from tumblr any more? connectionsSetting problems versions

johanneszab commented 4 years ago

Hi there!

First of all, sorry if the error message was not understandable. Sometimes it's hard as a developer to think into a person which never ever used the application and imagine him to use it.

The error message Limit exceeded: {0}. You should lower the connections to the tumblr api in the Settings->Connection pane. tells you, that you should change the values marked in this image: connection_settings It looks like you've adjusted the connection values at the top. Those value are purely for TumblThree's downloader, and you can most of the time leave them as they are. I tend to download only one blog in parallel (Concurrent blogs: 1). Tuning those values a little might increase your bandwidth utilization however.

The values I've marked define how often the TumblThree crawler connects to the API or SVC service. The API is rate limited, meaning your IP/host can only connect several times per timeframe to Tumblrs API service. Currently it's set that TumblThree makes 3 connections per 2 seconds (90 connections in 60 seconds) to this service. The API service provides json/xml data containing the blogs posts. TumblThree per default gets 50 posts per api request(connection). Thus, after 10 seconds and performing 15 requests to the API, TumblThree could potentially download 750 posts/images (3/2 10 50 = 750). Most of the time the download of the actual post content (e.g. the images, videos, ..) will take more time however, thus you could potentially decrease those API connection values to, say, 60 connections in 60 seconds, instead of 90 connections in 60 seconds.

This is what the error message tells you. If the "Limit exceeded"-message pops up, the Tumblr API as denied TumblThrees request to fetch information about some 50 blog posts, because we tried to get too much data (in Tumblrs point of view).

willemijns commented 4 years ago

Hello,

If default values can be high, maybe you can purpose a menu like "lowspeed but more result" using a multithread scheme is often equal to crawl something ;)

johanneszab commented 4 years ago

However, I've just tried it myself and I could crawl 63'000 blog posts using the Tumblr API with the default limits. Hence, I think they didn't actually lower the values.

There are several ways how the "Limit exceeded" could still get triggered:

Hence, if you browse Tumblex while downloading with TumblThree, Tumbex will perform javascript requests using your browser (e.g. your IP) and hence consume from the same limit as TumblThree does. Then you might want to lower the Tumblr API Connection settings as mentioned above.

I'll test some more blogs this night in parallel. Maybe I'm wrong after all.

johanneszab commented 4 years ago

And lastly, as mentioned in the latest release notes, you can change the Tumblr blog crawler implementation now and switch it for all blogs from the Tumblr API to Tumblr SVC. Therefore, select all blogs (ctrl-a) and change this value from Tumblr API to Tumblr SVC: crawler_selection

The Tumblr SVC (probably stands for service) was used to display hidden (login required) blogs before the NSFW ban came. This service might not be rate limited at all. I've never tested it as I'm not using TumblThree (or Tumblr) myself anymore that much.

willemijns commented 4 years ago

just worked with API.... dunno if they change the API to authorize crawl ?

thomas694 commented 4 years ago

The issue has been closed. You can still comment. Feel free to ask for reopening the issue if needed.