FoundationDB / fdb-joshua

FoundationDB Correctness service
Apache License 2.0
28 stars 25 forks source link

Once the fail_fast or runs limit is hit, any tests still running will not log or count their results. #36

Open sfc-gh-satherton opened 3 years ago

sfc-gh-satherton commented 3 years ago

It seems to be the case that if a bundle is in an ended state then at the very least tests which are still running and later end with failure will not record their failures or log events into the database. I think tests still running which later end in success might also not be recorded correctly but I'm not sure.

This leads to final job states where

started > ended
started > pass + fail
ended != pass + fail

While over-run (starting more than max-runs tests) been significantly reduced, resolving #1, it is still the case that sometimes a few extra tests are launched. This is problematic because if any test is going to end with a Timeout failure, it will take the longest to run and will not complete until long after the first max-runs tests have completed with success.

To give a concrete example, with a limit of 10000 runs if the 500th run is going to run forever and end with Timeout, and the over-run is just 1 test, then the job state will reach started=10001 pass=10000 ended=10000 and be stopped before the failing test completes, after which the failing test will not be recorded. Running the same correctness package with a larger run limit such as 100000 would expose the failure because the bundle will still be active when the timeout failure occurs so it will be recorded.

sfc-gh-kmakino commented 3 years ago

I'm not sure if this happens strictly when there are time-outed tests. I think this is what's happening: try_starting_test can return True to multiple agents if they ask concurrently. This will result started to overshoot. (In this case, if started=9999 and 2 agents calls try_starting_test, they both can start and started becomes 10001. Then, when one of them finishes and ended reaches max_runs, it stops the ensemble. This will result that when the other agent finishes its test, it won't find the ensemble and won't record the result.