BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

Document and/or Automate all Aspects of Checking an Openshift Cluster and Supporting Services #2965

Closed wmhutchison closed 11 months ago

wmhutchison commented 2 years ago

Describe the issue

This is a top-level EPIC that aims to act as an umbrella for reliably tracking and ensuring that every single component of a vendor-supplied Openshift cluster is being accurately tracked during maintenance activities, and also ensure when required that application owners are engaged to QA work when maintenance is completed and when automation has not yet been created to confirm these additional services via Canary applications and other checks.

EPICs that will nest under this EPIC will be promoted as appropriate as Sprint Goals, and tickets under each EPIC will be created for Platform Operations as well as targeted individuals from Platform Services as needed to ensure all involved stakeholders have separate tickets that accurately reflect their required time Estimate and setup Blocker dependencies if necessary.

Definition of Done

wmhutchison commented 2 years ago

Apart from DB Canary apps, we need a recurring (manual or automated) method to perform master fail-overs of the DB technology to ensure that works. Canary monitoring will then detect issues or failures for this.

Will document as a requirement for now during regular OCP maintenance, but definitely want this automated, but probably during business hours.

wmhutchison commented 2 years ago

Fail-over tests should be as "mean" as possible to the Patroni cluster in question. Either stand-alone or integration into Kraken is to be considered.

wmhutchison commented 11 months ago

Closing this EPIC as no longer required. Said monitoring is now done via n8n which is managed by the Platform Services Team.