This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
The sysdig alerts promQL for the Vault test app have been updated to better capture a failure of either the patroni stateful set or test app deployment.
The downtime for triggering an alert has been shortened - 15 minutes for prod and 30 minutes for lab
Because a failure of the Vault service would not trigger an immediate failure in the test app (the sidecar container restarts, but the pod remains "Ready"), a CronJob has been added to periodically restart the test app pods. A failure to connect to Vault causes the init container to prevent the normal startup of the pod, thus triggering the alert.
Updates have been applied to all clusters and tested in the labs.
Describe the issue We only receive RC notification when vault testing pod fail in gold and golddr. Let's have the same setup in all the clusters.
Definition of done