FoundationDB / fdb-joshua

FoundationDB Correctness service
Apache License 2.0
28 stars 25 forks source link

Let agents to exit when enough agents are serving existing ensembles #52

Closed sfc-gh-kmakino closed 3 years ago

sfc-gh-kmakino commented 3 years ago

When there are enough agents running for the existing ensembles, other agents were waiting in a sleep loop in case some running agents die.

e.g. When you run 100 agents and there are 8 tests remained, 8 agents will be performing the tests and 92 agents will simply sit there in a sleep 1 loop. When those 8 agents finish, the other 92 agents will also exit.

However, the problem is, when another ensemble with max-run=50 is submitted when those 92 agents are sleeping, 50 agents will pick up the new tests, but 42 agents will continue to sleep. This is extremely inefficient.

This PR will allow agents to exit if enough agents are running. If some agents die, then agent scaler will spin up new agents on k8s. Or other running agents will pick up the cancelled job when it finishes its current job.

sfc-gh-kmakino commented 3 years ago

Let me close this. It still has some edge cases. I'll spend more time to test.