Currently, roachtest will only check if a VM was preempted after a test failure. If at least one VM is found to be preempted, the test is marked as a flake automatically.
However, timeouts behave differently: they are always reported to the team that owns the test, even in the event of preemptions. Being able to bail quickly after a test failure (i.e., context cancelation) is important to avoid wastefully consuming resources for a long time (especially important for tests that need a large cluster and/or have a long timeout).
However, it's entirely possible that the test may not error out if a VM disappears. In those cases, the context never gets canceled and the test timeout is "reasonable".
One improvement to this situation is having roachtest proactively poll cloud APIs to learn about preemptions as soon as they happen. In those cases, the context can be canceled and the test is reliably marked as a flake. Everyone wins.
Currently, roachtest will only check if a VM was preempted after a test failure. If at least one VM is found to be preempted, the test is marked as a flake automatically.
However, timeouts behave differently: they are always reported to the team that owns the test, even in the event of preemptions. Being able to bail quickly after a test failure (i.e., context cancelation) is important to avoid wastefully consuming resources for a long time (especially important for tests that need a large cluster and/or have a long timeout).
However, it's entirely possible that the test may not error out if a VM disappears. In those cases, the context never gets canceled and the test timeout is "reasonable".
One improvement to this situation is having roachtest proactively poll cloud APIs to learn about preemptions as soon as they happen. In those cases, the context can be canceled and the test is reliably marked as a flake. Everyone wins.
Jira issue: CRDB-42692