Open sfc-gh-satherton opened 3 years ago
I'm not sure if this happens strictly when there are time-outed tests.
I think this is what's happening:
try_starting_test can return True
to multiple agents if they ask concurrently. This will result started
to overshoot. (In this case, if started=9999
and 2 agents calls try_starting_test
, they both can start and started
becomes 10001
.
Then, when one of them finishes and ended
reaches max_runs
, it stops the ensemble. This will result that when the other agent finishes its test, it won't find the ensemble and won't record the result.
It seems to be the case that if a bundle is in an ended state then at the very least tests which are still running and later end with failure will not record their failures or log events into the database. I think tests still running which later end in success might also not be recorded correctly but I'm not sure.
This leads to final job states where
While over-run (starting more than
max-runs
tests) been significantly reduced, resolving #1, it is still the case that sometimes a few extra tests are launched. This is problematic because if any test is going to end with a Timeout failure, it will take the longest to run and will not complete until long after the firstmax-runs
tests have completed with success.To give a concrete example, with a limit of 10000 runs if the 500th run is going to run forever and end with Timeout, and the over-run is just 1 test, then the job state will reach
started=10001 pass=10000 ended=10000
and be stopped before the failing test completes, after which the failing test will not be recorded. Running the same correctness package with a larger run limit such as 100000 would expose the failure because the bundle will still be active when the timeout failure occurs so it will be recorded.