We keep getting failures during longer simulations because the manager and workers disconnect at some point, and no metrics are reported by goose, however the prometheus metrics demonstrate that we would have still passed. Now we rely on goose for time, but keep a backup (this will be slightly longer and therefore won't bias us toward a better result), and use the prom metrics exclusively.
I also deleted the old recon keys only scenario as it no longer applies and is just noise.
We keep getting failures during longer simulations because the manager and workers disconnect at some point, and no metrics are reported by goose, however the prometheus metrics demonstrate that we would have still passed. Now we rely on goose for time, but keep a backup (this will be slightly longer and therefore won't bias us toward a better result), and use the prom metrics exclusively.
I also deleted the old recon keys only scenario as it no longer applies and is just noise.
Results from a 3 minute local test: