Working on #140. As per that discussion, there are multiple levels of monitoring we need to add, as there are multiple places that we can run into trouble. @ranchodeluxe already created a system to alert when a DPS job fails with the schedule-alert-failed-jobs workflow that effectively addresses the case where a DPS job fails.
Unfortunately, GitHub only sends an email to the originator of the workflow, and doesn't have a way to easily add more recipients (e.g., me). We could use the GitHub-Slack integration to send notifications for the workflow to a dedicated Slack channel, but also unfortunately that integration provides no way to filter to only failures. So, even if our runs are failing, we would be getting spammed with 23 false positives/day and the 1 actual problem would be buried in that stream.
So, I set up a simple Slack app and webhook that will send a message to a dedicated channel in our Slack workspace ONLY when the action fails. This PR just adds the logic for calling that. Luckily, we aren't the first team to have this problem and most of the work has already been done for us.
Working on #140. As per that discussion, there are multiple levels of monitoring we need to add, as there are multiple places that we can run into trouble. @ranchodeluxe already created a system to alert when a DPS job fails with the schedule-alert-failed-jobs workflow that effectively addresses the case where a DPS job fails.
Unfortunately, GitHub only sends an email to the originator of the workflow, and doesn't have a way to easily add more recipients (e.g., me). We could use the GitHub-Slack integration to send notifications for the workflow to a dedicated Slack channel, but also unfortunately that integration provides no way to filter to only failures. So, even if our runs are failing, we would be getting spammed with 23 false positives/day and the 1 actual problem would be buried in that stream.
So, I set up a simple Slack app and webhook that will send a message to a dedicated channel in our Slack workspace ONLY when the action fails. This PR just adds the logic for calling that. Luckily, we aren't the first team to have this problem and most of the work has already been done for us.