We have made several improvements to infrastructure flake detection: issues like ssh-flakes, timeouts, DNS issues, VM preemptions, public cloud errors or general blips rarely bubble up to teams. Instead, an issue is still created (or a comment added to an existing issue) and the issue is owned by us (examples: cluster creation, ssh_problem, dns_problem, etc).
The reporting still relies on creating an issue on GitHub just like regular failures. This approach has a few downsides:
failures due to infrastructure flakes are marked as test failures on TeamCity; our existing data pipeline ingestion (data consumed by other teams and execs) cannot see what failures are legitimate vs infrastructure flakes, leading to a distorted view of the daily failure ratio in these nightly runs.
It's very noisy for Test Eng: every night, test eng gets dozens of emails due to these flakes (primarily VM preemptions).
It's hard to analyze the frequency of these errors or monitor patterns: we would need to analyze GitHub data "manually", which is not ideal.
A better approach to report these errors would be to expose them in a way that can be consumed Snowflake. This would eliminate the daily noise, allow us to analyze behaviour over time, and perhaps even set up alerts if something is outside "expected" ratios.
We have made several improvements to infrastructure flake detection: issues like ssh-flakes, timeouts, DNS issues, VM preemptions, public cloud errors or general blips rarely bubble up to teams. Instead, an issue is still created (or a comment added to an existing issue) and the issue is owned by us (examples: cluster creation, ssh_problem, dns_problem, etc).
The reporting still relies on creating an issue on GitHub just like regular failures. This approach has a few downsides:
A better approach to report these errors would be to expose them in a way that can be consumed Snowflake. This would eliminate the daily noise, allow us to analyze behaviour over time, and perhaps even set up alerts if something is outside "expected" ratios.
Jira issue: CRDB-37456