FoundationDB / fdb-joshua

FoundationDB Correctness service
Apache License 2.0
28 stars 25 forks source link

Try not to start more than "max_runs" runs #18

Closed sfc-gh-anoyes closed 3 years ago

sfc-gh-anoyes commented 3 years ago

Introduce new logic for "should I start a run for this ensemble?". Try not overshoot max_runs by too much, but still read started at snapshot isolation to avoid serializing starting all runs. In case an another agent dies after incrementing started but before incrementing ended, agents will allow runs to start once they haven't seen ended change for timeout + 10 seconds.

It's possible that we may also want to change the "which ensemble should I try next?" logic as well to decrease the probability of choosing an ensemble that already has started >= max_runs.

Closes #1

sfc-gh-kmakino commented 3 years ago

This PR worked beautifully for me. (cc: @xumengpanda) 20210421-010208-kmakino-01bf6eabcc7e7790 compressed=True data_size=20650850 duration=113 ended=8 fail_fast=10 max_runs=8 pass=8 priority=100 remaining=0 runtime=0:02:11 sanity=False started=8 stopped=20210421-010419 submitted=20210421-010208 timeout=5400 username=kmakino

The only drawback is that when agents die, you need to wait for timeout + 10 seconds. This can potentially happen fairly frequently when running Joshua agents on spot instances. Maybe should we allow <N>% (or <N> agents) of overshoot where N is a parameter passed by a user?

xumengpanda commented 3 years ago

This PR worked beautifully for me. (cc: @xumengpanda) 20210421-010208-kmakino-01bf6eabcc7e7790 compressed=True data_size=20650850 duration=113 ended=8 fail_fast=10 max_runs=8 pass=8 priority=100 remaining=0 runtime=0:02:11 sanity=False started=8 stopped=20210421-010419 submitted=20210421-010208 timeout=5400 username=kmakino

The only drawback is that when agents die, you need to wait for timeout + 10 seconds. This can potentially happen fairly frequently when running Joshua agents on spot instances. Maybe should we allow <N>% (or <N> agents) of overshoot where N is a parameter passed by a user?

Great!

If we overshoot, we will face the same issue: Joshua return early without exposing failed long-running test. In the worst-case scenario, Joshua will take timeout seconds longer to finish. I feel a slower test (say 30min slower) is better than an incorrect test.

xumengpanda commented 3 years ago

This PR worked beautifully for me. (cc: @xumengpanda) 20210421-010208-kmakino-01bf6eabcc7e7790 compressed=True data_size=20650850 duration=113 ended=8 fail_fast=10 max_runs=8 pass=8 priority=100 remaining=0 runtime=0:02:11 sanity=False started=8 stopped=20210421-010419 submitted=20210421-010208 timeout=5400 username=kmakino The only drawback is that when agents die, you need to wait for timeout + 10 seconds. This can potentially happen fairly frequently when running Joshua agents on spot instances. Maybe should we allow <N>% (or <N> agents) of overshoot where N is a parameter passed by a user?

Great!

If we overshoot, we will face the same issue: Joshua return early without exposing failed long-running test. In the worst-case scenario, Joshua will take timeout seconds longer to finish. I feel a slower test (say 30min slower) is better than an incorrect test.

I think twice about this issue. We may need to overshoot some amount of agents as @sfc-gh-kmakino suggested and also have something similar to https://github.com/FoundationDB/fdb-joshua/pull/19 to make sure any started tests finish before Joshua finishes.

The statement comes from the problem in this scenario: Say max_run=100K and timeout=30min; We have 10 agents crashed at the end of an ensemble run. We wait for 30min + 10s to detect it and spawn 10 new tests on 10 agents. 2 of the 10 agents crashes. We wait for another 30min + 10s. Joshua run takes an extra 60min + 20s to finish.

Joshua run can take multiple timeout seconds to finish if agents crash frequently.

sfc-gh-kmakino commented 3 years ago

I do not think it's common to have such a high failure rate even with spot instances, but even with 1 failed agent can delay the test by a significant amount of time, which isn't ideal. Also, another common case is when we deploy a new agent image on k8s. We usually delete all agent jobs to re-spawn. We may need a way to force starting new tests without waiting for the timeout+10 in that scenario.