cluebotng-review Toolforge tool is making lots of internal HTTP requests

supertassu commented 2 years ago

Hello! While working on an unrelated Toolforge problem, I noticed that the cluebotng-review tool is making lots of HTTP requests to the cluebotng tool (cluebotng.toolforge.org). As far as I can see, there are two problems here:

First, the requests have a generic Go-http-client/2.0 user-agent. This violates the user-agent policy and makes it more difficult for us to track down misbehaving clients.
More importantly, the requests are happening at a very high rate, I'm seeing traffic upwards of 300 reqs/second. Due to some misbehaving crawlers we've had to apply some per-IP rate limits for all Toolforge tools. These are currently limiting traffic to around 50-100 requests/second. If your tool needs higher rate limits we can figure something out, but if you can live with lower traffic levels please do consider implementing some rate limiting on your client side.

Thanks in advance for any fixes!

DamianZaremba commented 2 years ago

Hi,

Hello! While working on an unrelated Toolforge problem, I noticed that the cluebotng-review tool is making lots of HTTP requests to the cluebotng tool (cluebotng.toolforge.org). As far as I can see, there are two problems here:

First, the requests have a generic Go-http-client/2.0 user-agent. This violates the user-agent policy and makes it more difficult for us to track down misbehaving clients.

This should now be corrected to set the standard 'generic cluebot' user agent header, unfortunately it was missed when porting old logic due to https://phabricator.wikimedia.org/T233347 increasing the overhead to view basic access logs (the endpoint serves the live data for ClueBotNG).

More importantly, the requests are happening at a very high rate, I'm seeing traffic upwards of 300 reqs/second. Due to some misbehaving crawlers we've had to apply some per-IP rate limits for all Toolforge tools. These are currently limiting traffic to around 50-100 requests/second. If your tool needs higher rate limits we can figure something out, but if you can live with lower traffic levels please do consider implementing some rate limiting on your client side.

Is 300 reqs/second actually causing an issue? All the bulk request are made in a serial manner, currently the api endpoint is returning an error so responds in ~0.044098 seconds, which would make > 300 reqs/second not totally unreasonable.

I was in the process of moving this over to a dedicated endpoint for the bulk queries, which is now working and takes around ~4 seconds (due to http api queries), so the overall rate should drop.

With that said, are you sure some of these queries are not coming from https://github.com/cluebotng/trainer, which will fallback onto the api endpoint if the (cached) data is not returned from cluebotng-review (which effectively none was cached due to the previous endpoint returning a 404).

Thanks in advance for any fixes!

supertassu commented 2 years ago

First, the requests have a generic Go-http-client/2.0 user-agent. This violates the user-agent policy and makes it more difficult for us to track down misbehaving clients.

This should now be corrected to set the standard 'generic cluebot' user agent header, unfortunately it was missed when porting old logic due to phabricator.wikimedia.org/T233347 increasing the overhead to view basic access logs (the endpoint serves the live data for ClueBotNG).

Thank you!

More importantly, the requests are happening at a very high rate, I'm seeing traffic upwards of 300 reqs/second. Due to some misbehaving crawlers we've had to apply some per-IP rate limits for all Toolforge tools. These are currently limiting traffic to around 50-100 requests/second. If your tool needs higher rate limits we can figure something out, but if you can live with lower traffic levels please do consider implementing some rate limiting on your client side.

Is 300 reqs/second actually causing an issue? All the bulk request are made in a serial manner, currently the api endpoint is returning an error so responds in ~0.044098 seconds, which would make > 300 reqs/second not totally unreasonable.

Doing the bulk requests in serial sounds fine from my perspective, although ~0.044098 response time would be about 23 requests per second instead of >300?

Right now your traffic isn't directly causing issues for the shared infrastructure, but it's getting caught in the rate limiters I set up to block other traffic that was causing issues. I can try to tune the rate limiters to focus more on concurrency so that fast but parallel requests aren't affected as much.

I was in the process of moving this over to a dedicated endpoint for the bulk queries, which is now working and takes around ~4 seconds (due to http api queries), so the overall rate should drop.

With that said, are you sure some of these queries are not coming from cluebotng/trainer, which will fallback onto the api endpoint if the (cached) data is not returned from cluebotng-review (which effectively none was cached due to the previous endpoint returning a 404).

As far as I can see from my logs, the bulk of requests to the cluebotng tool are coming from the cluebotng-review tool. I don't see such heavy traffic to the cluebotng-review tool itself.

cluebotng / reviewng

cluebotng-review Toolforge tool is making lots of internal HTTP requests #12