There are places in the checklist where we're not living up to the monitoring, alerting, and issue-creating standards set by the ZSF -- please create each of these as tickets and add the zero-silent-failures label.
This includes needs to improve tagging, having a workflow to ensure the notifications channel is monitored/someone responds when they are looking into an issue, and creating issues out of all datadog alerts.
Jobs related to our submissions are currently regularly reported as top sources of silent failures in the weekly status report on this effort. It's possible that some of these jobs either automatically trigger backup paths or veteran notifications and should not be considered silent failures.
There are places in the checklist where we're not living up to the monitoring, alerting, and issue-creating standards set by the ZSF -- please create each of these as tickets and add the zero-silent-failures label.
This includes needs to improve tagging, having a workflow to ensure the notifications channel is monitored/someone responds when they are looking into an issue, and creating issues out of all datadog alerts.
Jobs related to our submissions are currently regularly reported as top sources of silent failures in the weekly status report on this effort. It's possible that some of these jobs either automatically trigger backup paths or veteran notifications and should not be considered silent failures.
@kylesoskin to add details here
Resources