Make initial delay selection more sane

astei commented 3 years ago

See VelocityPowered/Velocity#476 for justification.

Bastian commented 3 years ago

Thank you for the PR!

The current design with two random delays was fully intentional and hitting ratelimits more often is an accepted compromise.

Let me start by explaining why the random delay even exists. The initial logic was very simple: Wait 5 minutes after server startup and then send the data every 30 minutes. What I didn't consider back then, was that many servers tend to restart at a fixed time at a full hour mark (xx:00). This caused a spike in requests every 30 minutes at xx:05 and xx:35 which resulted in a high CPU usage at these times:

The first idea to solve this issue was very similar to your PR: Just make the initial delay a bit random (e.g., MiniDigger suggested 5 min + random between 0 and 15 min). This however comes with a new set of issues:

If the random delay is too large (like the suggested 15 min), it might take a very long time for the server to send data initially and we could miss some
The load would still be unevenly distributed, just more spread out. This becomes especially relevant for smaller random delays like the 5 minutes in this PR. Instead of having about 100% more requests for ~1 minute at xx:05 and xx:35, these extra requests would now be distributed over 5 minutes. This would definitely better but still cause about ~20% more requests in that period.

This means that it is necessary to find a good balance and choose a not too short but also not too long random delay. Or alternatively, find another approach.

The current solution is a hybrid that tries to get the best of both worlds: It still uses a (short) initial random delay (between 3-6 minutes) but then instead of just sending the data every 30 minutes, it introduces a second delay between 0-30 minutes and only then starts the regular 30 minutes interval. This makes it very likely to hit a ratelimit once initially (in fact, in nearly 50% of all cases). Since ratelimits are super (CPU-)inexpensive for the backend to handle, this is no issue at all. However, it also makes sure that requests are evenly distributed and thus solves the issue quite nicely.

That being said, I agree that there is still space for improvements:

It happens quite often that people report bugs because they received a 429 response from the backend. While you have to enable the logFailedRequests config option to see it (which is only intended for debugging), it indeed looks scary with its huge stacktrace and WARN log level. It might be a good idea to handle it differently and log a less verbose message without a stack trace. Something along these lines: "Received 429 ratelimit from bStats. This is expected and not causing issues".
The fixed initial delay of 3 minutes (3 min + random between 0 and 3 min) is a remnant of the old bStats Metrics classes that sent the data for all plugins in a single request and thus needed it to give the other plugins some time to startup (otherwise the first loaded plugin would only send its own data initially). This is not really necessary anymore and can probably be removed.

astei commented 3 years ago

Let me start by explaining why the random delay even exists.

I know why it exists - having made the Buycraft plugin, we ran into this issue too (and we had the advantage of doing near-infinite scaling on AWS!) and I implemented randomization to spread out the load from servers sending requests to pull purchases from Buycraft.

I'll probably implement your ideas here though, just to make the issue much less scary-looking.

astei commented 3 years ago

Decided to not pursue this anymore - more pressing things to do.

Bastian commented 3 years ago

Thanks anyway 🙂

I've created a new issue for it to not forget it: https://github.com/Bastian/bStats-Metrics/issues/86

Bastian / bstats-metrics

Make initial delay selection more sane #84