cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30k stars 3.79k forks source link

roachtest: better reporting of infrastructure flakes #121696

Open renatolabs opened 6 months ago

renatolabs commented 6 months ago

We have made several improvements to infrastructure flake detection: issues like ssh-flakes, timeouts, DNS issues, VM preemptions, public cloud errors or general blips rarely bubble up to teams. Instead, an issue is still created (or a comment added to an existing issue) and the issue is owned by us (examples: cluster creation, ssh_problem, dns_problem, etc).

The reporting still relies on creating an issue on GitHub just like regular failures. This approach has a few downsides:

A better approach to report these errors would be to expose them in a way that can be consumed Snowflake. This would eliminate the daily noise, allow us to analyze behaviour over time, and perhaps even set up alerts if something is outside "expected" ratios.

Jira issue: CRDB-37456

blathers-crl[bot] commented 6 months ago

cc @cockroachdb/test-eng