Closed sfc-gh-anoyes closed 3 years ago
This PR worked beautifully for me. (cc: @xumengpanda)
20210421-010208-kmakino-01bf6eabcc7e7790 compressed=True data_size=20650850 duration=113 ended=8 fail_fast=10 max_runs=8 pass=8 priority=100 remaining=0 runtime=0:02:11 sanity=False started=8 stopped=20210421-010419 submitted=20210421-010208 timeout=5400 username=kmakino
The only drawback is that when agents die, you need to wait for timeout
+ 10 seconds. This can potentially happen fairly frequently when running Joshua agents on spot instances. Maybe should we allow <N>
% (or <N>
agents) of overshoot where N is a parameter passed by a user?
This PR worked beautifully for me. (cc: @xumengpanda)
20210421-010208-kmakino-01bf6eabcc7e7790 compressed=True data_size=20650850 duration=113 ended=8 fail_fast=10 max_runs=8 pass=8 priority=100 remaining=0 runtime=0:02:11 sanity=False started=8 stopped=20210421-010419 submitted=20210421-010208 timeout=5400 username=kmakino
The only drawback is that when agents die, you need to wait for
timeout
+ 10 seconds. This can potentially happen fairly frequently when running Joshua agents on spot instances. Maybe should we allow<N>
% (or<N>
agents) of overshoot where N is a parameter passed by a user?
Great!
If we overshoot, we will face the same issue: Joshua return early without exposing failed long-running test. In the worst-case scenario, Joshua will take timeout seconds longer to finish. I feel a slower test (say 30min slower) is better than an incorrect test.
This PR worked beautifully for me. (cc: @xumengpanda)
20210421-010208-kmakino-01bf6eabcc7e7790 compressed=True data_size=20650850 duration=113 ended=8 fail_fast=10 max_runs=8 pass=8 priority=100 remaining=0 runtime=0:02:11 sanity=False started=8 stopped=20210421-010419 submitted=20210421-010208 timeout=5400 username=kmakino
The only drawback is that when agents die, you need to wait fortimeout
+ 10 seconds. This can potentially happen fairly frequently when running Joshua agents on spot instances. Maybe should we allow<N>
% (or<N>
agents) of overshoot where N is a parameter passed by a user?Great!
If we overshoot, we will face the same issue: Joshua return early without exposing failed long-running test. In the worst-case scenario, Joshua will take timeout seconds longer to finish. I feel a slower test (say 30min slower) is better than an incorrect test.
I think twice about this issue. We may need to overshoot some amount of agents as @sfc-gh-kmakino suggested and also have something similar to https://github.com/FoundationDB/fdb-joshua/pull/19 to make sure any started tests finish before Joshua finishes.
The statement comes from the problem in this scenario: Say max_run=100K and timeout=30min; We have 10 agents crashed at the end of an ensemble run. We wait for 30min + 10s to detect it and spawn 10 new tests on 10 agents. 2 of the 10 agents crashes. We wait for another 30min + 10s. Joshua run takes an extra 60min + 20s to finish.
Joshua run can take multiple timeout seconds to finish if agents crash frequently.
I do not think it's common to have such a high failure rate even with spot instances, but even with 1 failed agent can delay the test by a significant amount of time, which isn't ideal. Also, another common case is when we deploy a new agent image on k8s. We usually delete all agent jobs to re-spawn. We may need a way to force starting new tests without waiting for the timeout+10 in that scenario.
Introduce new logic for "should I start a run for this ensemble?". Try not overshoot max_runs by too much, but still read started at snapshot isolation to avoid serializing starting all runs. In case an another agent dies after incrementing
started
but before incrementingended
, agents will allow runs to start once they haven't seenended
change fortimeout
+ 10 seconds.It's possible that we may also want to change the "which ensemble should I try next?" logic as well to decrease the probability of choosing an ensemble that already has
started >= max_runs
.Closes #1