cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.97k stars 3.79k forks source link

roachtest: poll for preemptions instead of querying after test failures #131759

Open renatolabs opened 2 days ago

renatolabs commented 2 days ago

Currently, roachtest will only check if a VM was preempted after a test failure. If at least one VM is found to be preempted, the test is marked as a flake automatically.

However, timeouts behave differently: they are always reported to the team that owns the test, even in the event of preemptions. Being able to bail quickly after a test failure (i.e., context cancelation) is important to avoid wastefully consuming resources for a long time (especially important for tests that need a large cluster and/or have a long timeout).

However, it's entirely possible that the test may not error out if a VM disappears. In those cases, the context never gets canceled and the test timeout is "reasonable".

One improvement to this situation is having roachtest proactively poll cloud APIs to learn about preemptions as soon as they happen. In those cases, the context can be canceled and the test is reliably marked as a flake. Everyone wins.

Jira issue: CRDB-42692

blathers-crl[bot] commented 2 days ago

cc @cockroachdb/test-eng