Visibility into failures while working remotely

clarafu commented 4 years ago

Things we need to monitor

[ ] Failing builds
[ ] Hush-house web/workers/db (datadog?)
[ ] CI web/workers/db (datadog?)
[ ] Tracing nodes
[ ] SLIs

One idea that I had was to have a slack bot that will notify us when any of these are "down". For example, when the hush-house nodes are not running or the SLIs are failing. Currently, we have any failing builds within our concourse pipelines to notify us using a slack message everytime there is a failing build. But this gets quite chatty and it is often hard to tie together multiple related failures when there is a separate slack message for each. So if we use "downtime" as the measurement rather than each build failure, that can help condense the slack messages sent to the channel.

Another idea is rather than using slack messages to notify when something is down, we can use CCMenu http://ccmenu.org/. Every machine will have it installed and it will show whenever something is failing or not up. This integrated with Concourse already so we can directly notify the CCMenu app through the Concourse jobs.

One thing I wanted to note was that it would be nice if we had a separate channel on slack in order to keep context for these interrupts. I was thinking using slack so that we can have threads for each discussion about a failure. I was also thinking slack rather than github issues because the process of creating and managing issues is a lot more work than sending a message. Plus I don't think we will need to keep the context of solved failures.

EDIT: We also would like to have a weekly report of the current state of all the things we want to monitor. It might include things like % of uptime of all the vm nodes (hush-house, ci, tracing), SLI success rate, the success rate of each job within our concourse pipelines, and also the number of flakes per test.

clarafu commented 4 years ago

@jamieklassen mentioned having flake attempts be part of a weekly report to our slack channel, which I think would be a really good idea. The weekly report can include things like the percentage of downtime for jobs that had a lot of downtime within the week and also for tests that had flaked tests (which should be reported in the build output). Exactly how we can fetch that data, I'm not exactly sure yet but I think it would be important to monitor that information because we don't want to ignore any test failures.

jamieklassen commented 4 years ago

Here's a bit of a prototype for tracking flakes in a metrics store (hopefully it works equally well with grafana or datadog or wavefront or whatever thanks to opentelemetry... 🤞): https://github.com/jamieklassen/flake-reporting

github-actions[bot] commented 4 years ago

This issue has had no activity for a while. It has been labeled stale, and will be closed in one week.

github-actions[bot] commented 3 years ago

This issue has had no activity for a while. It has been labeled stale, and will be closed in one week.

concourse / ci

Visibility into failures while working remotely #366