FoundationDB / fdb-joshua

FoundationDB Correctness service
Apache License 2.0
28 stars 25 forks source link

"Max runs" should mean "start this many tests and wait for all to complete or fail". It does not. #1

Closed sfc-gh-satherton closed 3 years ago

sfc-gh-satherton commented 3 years ago

Currently, a correctness job is stopped when its observed completion count reaches max_runs. This logic means there are usually other tests still running at this time which were launched before the completion count hit the limit. Successes after the job is in a "stopped" state are still tallied later, though I'm not sure if errors are.

Automation around joshua usually assumes that a stopped job's result stats are final. One way this happens is using joshua tail to detect when the bundle has completed, and then querying and reporting stats for the job id at that time, in which case the reported counts do not include the results of the tests which are still running.

Running joshua list --stopped later will provided more updated results which include counts of, at least, the successful tests which completed after the bundle was placed in a stopped state. I am not certain if errors are tallied if the job is in a stopped state when the error is detected.

Overrun is as much as 15% for a 10k run limit, or as much as 90% for a 1k run limit. Some recent examples:

20210226-235629-tclinkenbeard-c49be74f2fda67b6 ended=11489 max_runs=10000
20210227-043444-nightly_valgrind_master-7662b214e378d1a3 ended=1408 max_runs=1000
20210227-043450-nightly_valgrind_release-6.3-19de4482b07a36ad ended=1285 max_runs=1000
20210228-031514-nightly_valgrind_master-7662b214e378d1a3 ended=1856 max_runs=1000
20210301-041319-nightly_valgrind_master-ed06c77214d969f8 ended=1902 max_runs=1000
20210301-193115-anoyes-82f6ff9c9a65e4ca ended=10754 max_runs=10000
20210301-193149-nwijetunga-f521b0a3e36b70e9 ended=10751 max_runs=10000
sfc-gh-almiller commented 3 years ago

20210311-095708-almiller-c56e059102618f3e compressed=True data_size=166925824 fail_fast=10 max_runs=1 priority=100 remaining=not_started runtime=0:02:43 sanity=False started=1075 submitted=20210311-095708 timeout=5400 username=almiller

max_runs=1 started=1075

This is a wee bit overkill