When there are enough agents running for the existing ensembles,
other agents were waiting in a sleep loop in case some running agents die.
e.g. When you run 100 agents and there are 8 tests remained,
8 agents will be performing the tests and 92 agents will simply sit there in a sleep 1 loop.
When those 8 agents finish, the other 92 agents will also exit.
However, the problem is, when another ensemble with max-run=50 is submitted when those 92 agents are sleeping,
50 agents will pick up the new tests, but 42 agents will continue to sleep.
This is extremely inefficient.
This PR will allow agents to exit if enough agents are running.
If some agents die, then agent scaler will spin up new agents on k8s. Or other running agents will pick up the cancelled job when it finishes its current job.
When there are enough agents running for the existing ensembles, other agents were waiting in a sleep loop in case some running agents die.
e.g. When you run 100 agents and there are 8 tests remained, 8 agents will be performing the tests and 92 agents will simply sit there in a
sleep 1
loop. When those 8 agents finish, the other 92 agents will also exit.However, the problem is, when another ensemble with
max-run=50
is submitted when those 92 agents are sleeping, 50 agents will pick up the new tests, but 42 agents will continue to sleep. This is extremely inefficient.This PR will allow agents to exit if enough agents are running. If some agents die, then agent scaler will spin up new agents on k8s. Or other running agents will pick up the cancelled job when it finishes its current job.