API Throttling Being Tripped Causing IP To Be Banned For Hours

ghost commented 11 years ago

I noticed the cache was not being updated so I checked the files to see the question ids file was empty. Ran the script via CLI and got the following error.

Warning: file_get_contents(): HTTP request failed! HTTP/1.1 400 Bad Request
 in /usr/local/www/sites/cv-pls.com/dev/application/classes/Backlog.php on line 90

Warning: file_get_contents(compress.zlib://https://api.stackexchange.com/2.1/search/advanced?filter=%21wQ0g-ul-W8LDT0w&key=pMxerkFG8E257Xblt5BUHA%28%28&order=desc&pagesize=100&site=stackoverflow&sort=creation&tagged=php&page=1): failed to open stream: operation failed in /usr/local/www/sites/cv-pls.com/dev/application/classes/Backlog.php on line 90

What is weird is accessing the API request path directly shows the API data. I ran the script a few times and then it correctly got the data. Nothing changed between it working and not working.

Anyone have an idea what could be happening? I don't think it is the API key since the requests are small and spread out and fall well within the throttling limit, unless it's an API bug.

DaveRandom commented 11 years ago

@KyraD The body of the 400 response may quite possibly contain a more useful error message. You can have it return the content so you can inspect it by setting $ctx['http']['ignore_errors'] = true; in the stream context, but inspecting the response code to determine whether you got an error becomes messy when you do this.

I'd recommend switching to cURL or (preferably from my point of view) Artax, as this will give much more granular control over the process.

I'll put together a PR for switching to cURL later today, I guess introducing dependency on a 3rd party lib needs a little more discussion before we jump in at the deep end.

ghost commented 11 years ago

@DaveRandom I'd like to use Artax for handling all the HTTP requests. We can always fork the code if needed if it ends up no longer being maintained.

It turns out it was a throttling issue, so going to reduce the API page requests for gathering question ids from 15 to 10 should still return 200-300 ids which is not too much and not too little.

hakre commented 11 years ago

As written in chat, I've experienced 400 errors and the patterns are similar. First a request is giving 400 and then the same request later on works.

From my experience this was when hitting the site too quickly. I checked for throtteling headers but there were none. I have not found out which time-span triggered this result, however I did found out that it depends how "expensive" the request is.

E.g. when searching one specific chatroom sorted by newest, it was not getting 400s so quickly as if I was searching all chatrooms. Sorting even by stars did create 400s nearly every second request.

The script I used when getting the 404s looks like here: https://gist.github.com/hakre/0058ab9d2b9e914e0a49 - It uses file_get_contents under the hood. No magic in there. If you want to see more code, let me know, no problem to share, it's just not online yet.

ghost commented 11 years ago

Ok my IP was unbanned by the API and I reduced the pages requested from 15 to 10 and also put sleep(1) right before the file_get_contents() so it will reduce the requests to 1 request per second. Because according to the throttling documentation any IP doing 30+ requests a second gets banned.

This is what was happening before:

15 requests to get latest PHP questions
~5 requests to get those question's data
~5 requests to get chatroom question's data

So overall there was ~25 API requests, and this was only for the production site not the dev site which it was also pulling the same amount at the same time which would be ~50 combined.

Now it should be down to ~36 combined requests made when cron runs. If it does trip again even with the code changes and the request reduction I will have to stagger the cron jobs. But really the sleep(1) call should fix it alone since it will reduce the calls to max of 4 per second.

DaveRandom commented 11 years ago

Has @Shog9 been notified of the reason for the error? (I noticed he was hanging around a bit last night)

This should be reported to StackExchange because they should be returning a 429 response when throttling is in force, this would be much more descriptive of the actual problem and had they done this we wouldn't have spent 3 days debugging it.

Let me know and I will created a meta question and go and badger Shog and balpha about it (or you are welcome to do it yourself :-P)

ghost commented 11 years ago

@DaveRandom according to this http://api.stackexchange.com/docs/error-handling it should have sent a 502 error considering it banned my IP for going over the throttle limit. A 400 code seems the wrong code to send me as the request was perfectly valid.

However, it says this:

The HTTP code will be 400 (Bad Request) for all errors unless the method was called via JSONP, in which case even an error will be returned as a 200 (OK). This is necessary, as a 400 code will generally prevent a client side app from reading the remaining error details if the call was via JSONP. In rare cases (typically dealing with network wide maintenance or hardware failure), errors may occur in processing a request before the API determines whether a request is via JSONP; in these cases a 400 (Bad Request) is returned.

Not sure if that would apply to the BacklogUI code's method of retrieval.

Edit/Update:

ah ok, so those error codes are not HTTP error codes, rather error code specific to the API. So it seems 400 will pretty much be returned in all cases. Will need to perform error checking then, also the throttling adjustment did not work, server is once again banned. I'm not understanding why the sleep() would not work. Should it be increased?

DaveRandom commented 11 years ago

The above excerpt does seem to imply that further details of the error are included in the response body when this occurs.

those error codes are not HTTP error codes, rather error code specific to the API

I don't like this. The API uses HTTP as a transport mechanism, it should use HTTP response codes correctly.

All this is really just moaning about something that we cannot control, so from a more constructive angle I will take a look at teh codez in a bit and see if there is any room for improvement that leaps out at me, but I suspect the problem is simply that the API is not designed for retrieving so much data in such a short time frame, I'll try and come up with a way to maintain the data more "efficiently" - in reality this will decrease accuracy but we may be able to do this in such a way that it's not noticeable.

ghost commented 11 years ago

I was having the cron call the scripts at the same time, which was rather bad. So I've made it run one after the other, but I did notice even after that there was an occasion where it got banned again. The thing I don't understand is why the ban is so long. It says a few minutes in the docs, but I was banned for ~2hrs last time.

PeeHaa commented 11 years ago

I think we should really ping @kevin-montrose or some other dev/maintainer of the stack API. About what is going on and how to possibly fix it.

DaveRandom commented 11 years ago

@KyraD

The thing I don't understand is why the ban is so long. It says a few minutes in the docs, but I was banned for ~2hrs last time.

IIRC it's an exponential increase for every request made during the ban period. The response JSON in the last response before you actually get banned has a back-off instruction in it IIRC (which a time in seconds before the next request will be accepted), I need to go read the docs again to confirm this (I may be thinking of some other API but I don't think so).

I'm thinking that a daemon might be more appropriate than cron in this instance.

DaveRandom commented 11 years ago

@KyraD Can you indicate where in the repo the code that is being called by the cronjob is? I can't seem to work it out. Edit never mind I think I figured it out. But can you show the lines you have in your crontab?

ghost commented 11 years ago

My cron is:

*/2 * * * * bash -eu /usr/local/bin/cv-pls.sh

and cv-pls.sh is this, where it calls API and Chatroom data sources for both the dev and production sites:

#!/bin/bash -eu

/usr/local/php/dev/bin/php -c /usr/local/etc/php/dev /usr/local/www/sites/cv-pls.com/dev/web/backlog.php >/dev/null 2>&1
/usr/local/php/dev/bin/php -c /usr/local/etc/php/dev /usr/local/www/sites/cv-pls.com/prod/web/backlog.php >/dev/null 2>&1
/usr/local/php/dev/bin/php -c /usr/local/etc/php/dev /usr/local/www/sites/cv-pls.com/dev/web/backlog.php chatroom >/dev/null 2>&1
/usr/local/php/dev/bin/php -c /usr/local/etc/php/dev /usr/local/www/sites/cv-pls.com/prod/web/backlog.php chatroom >/dev/null 2>&1

_{Note: I should probably output to an error log since only output to occur should be a PHP error as it's run using dev php.ini settings. Also -eu kills the cron so if first call errors then the other 3 never get processed.}

With the current BackogUI cache values:

Chatroom Data Source:
- IDs are fetched every 15 minutes (non-Stack API)
- Data for them is fetched every 2 minutes
API Data Source:
- IDs are fetched every 5 minutes
- Data for them is fetched every 2 minutes

IIRC it's an exponential increase for every request made during the ban period. The response JSON in the last response before you actually get banned has a back-off instruction in it IIRC (which a time in seconds before the next request will be accepted),

If that was the case then the ban should never have been lifted as I made no effort to stop requests from being made once it was banned. Also when I tested via cURL to see the error response it told me I had x amount of seconds until the ban is lifted and that number went down as I made more requests indicating it was not counting negatively against me. The docs don't say much about how the time-frame is determined.

Room-11 / CVBacklogUI

API Throttling Being Tripped Causing IP To Be Banned For Hours #15