cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

roachtest: flake detection does not work when used with `require` API #131094

Closed renatolabs closed 2 weeks ago

renatolabs commented 1 month ago

Roachtest has automatic failure detection that relies on annotated errors to identify the types of flake, which team to assign the issue to, whether to report it or not, etc. See github.go for more details.

However, this does not work when the test uses the require API to ensure the absence of errors. For instance, if test code runs something like the following:

require.NoError(t, c.SomeClusterFunction())

then, if SomeClusterFunction fails due to a flake (registry.ErrorWithOwner where InfraFlake is true), the failure will not be handled as a flake. This has to do with how require reports errors and our implementation not being able to see the underlying error when reporting the failure.

I'm pretty sure there should be a way to fix this and support require in roachtests, but this will require (pun intended) some investigation.

Jira issue: CRDB-42369

blathers-crl[bot] commented 1 month ago

cc @cockroachdb/test-eng

srosenberg commented 1 month ago

The issue is essentially this Sprintf [1], so we end up losing the err object. Short of instrumenting require, that leaves the only option of matching on strings of the form "TRANSIENT_ERROR(ssh_problem): exit status ..." in order to recover the err object expected by failuresMatchingError.

[1] https://github.com/stretchr/testify/blob/c4b8421a1fb1a1eed787101978a3bcf7ae8a5d0f/assert/assertions.go#L1584