Closed vcunat closed 5 months ago
I forgot to say that this has been deployed for about 30h now, but let me open it here for discussion (and further tweaks perhaps).
Ideally we'd have some mechanism that prevents further scaling if rhea can't keep up feeding the machines with work. Currently the scaling is based just on the count of jobs ready to be built, but there clearly are some other bottlenecks sometimes.
My understanding is that we'll be using the scaling much less by May anyway, so the impact of this issue should become lower than now.
The changes are intentionally harsher for aarch64 than for x86_64, as that seems desirable.
It's true that we shouldn't keep this open/unresolved for too long, as deploying something else to rhea would undo the changes.
It's been commonly hapenning that we spin up many machines but can't keep them occupied due to bottlenecks in the central machine (probably it's mostly the compression of copying-results step)
So let's scale less aggressively and thus waste less.