battlecode / galaxy

MIT License
10 stars 3 forks source link

Making scaledown of match / execute servers more gradual and slower #729

Open n8kim1 opened 6 months ago

n8kim1 commented 6 months ago

To alleviate #605

This would slightly increase costs of year-round runs, especially when someone runs only one match in the random middle of the year. But those increases shouldn't be much anyways

Happy to tweak params, or just not do this anyways. I mainly made this PR just to close my tabs xp

https://cloud.google.com/compute/docs/autoscaler#scale-in_controls https://cloud.google.com/compute/docs/autoscaler/understanding-autoscaler-decisions#delays_in_scaling_in

j-mao commented 6 months ago

Noting that #720 already alleviates this a lot by making there be less machines. If we can come up with a test plan to evaluate this before/after, we can deploy and evaluate what works better.

Noting also that scaling is already delayed by 1-2 mins because pub/sub metrics take time to propagate. So by the time scaling in happens, the scrim queue is actually far less than the assignment ratio.

n8kim1 commented 6 months ago

about #720 -- Yes good catch (had thought about that but forgot to mention). For a plan... what if I watch the queue during the next tournament as-is, and we can evaluate how much scrimmage servers interrupted? (like how much the bad behavior is still present) I can also deploy this version and watch a tournament too

about delay -- good to know, thanks; That should help a ton for scaling in, and I can make scaling out a bit faster. Unfortunately relying on the baked-in 1-2 minute delay isn't probably long enough on its own though