BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

Develop a more comprehensive testing plan for future Openshift upgrades #2276

Closed wmhutchison closed 2 years ago

wmhutchison commented 2 years ago

Describe the issue The Openshift 4.8 upgrade revealed that as additional software/components are installed in both LAB and PROD clusters, we need to ensure that all aspects of these components are being reliably tested. The end result going forward would be likely expanding on the existing EPIC tickets Platform Ops opens for upgrades, which we'd expand as well to include additional components and assign those additional tickets to whomever is responsible for the apps/components in question.

Additional context Since CCM is the main mechanism for applying new services/components, it would be good to have documention within CCM to hook into other docs so that as other parties expand upon in CCM, Platform Ops is made aware of these new items so they are not having to re-audit CCM each time Openshift is upgraded.

Definition of done

wmhutchison commented 2 years ago

[1:11 p.m.] Barre, Steven Suggestion for the testing: Check https://console.apps.klab.devops.gov.bc.ca/monitoring/dashboards/grafana-dashboard-api-performance and see if any of the graphs show a significant change between the before and after upgrade. KAMLOOPS LAB CLUSTER

StevenBarre commented 2 years ago

Could we also get someone to create a reference app, with full Sysdig / uptime monitoring? Verify the "prod" app survives the upgrade with no intervention and stays up. Post upgrade rolls through a new build and deploy to ensure all those steps still work. Sample app should make use of all the platform services as well. Vault, Artifactory, VPA, Sysdig, RHEL entitlement builds, Patroni, CrunchyDB, etc

wmhutchison commented 2 years ago

At present the upgrade methodology goes like this.

  1. Platform Ops team communicates their intent to start work on an Openshift upgrade, and proceeds to do so on the LAB clusters.
  2. QA of the LAB clusters is solely handled by Platform Ops, and mainly driven from the concept of whether or not any current AlertManager or Nagios alerts are being created. No other stakeholders are notified or brought into the mix for testing other portions of the cluster.
  3. With a passing grade in LAB, Platform Ops creates suitable RFCs for the PROD clusters, creates RFCs, obtains approval for said RFCs, sends out comms in advance, and then executes as they come up.
  4. QA of PROD clusters follow similar suit to LAB environments.
wmhutchison commented 2 years ago

Ignoring whether or not we're talking about LAB or PROD, we want to strengthen overall communications and testing processes as follows.

  1. Notify internally in advance of an upcoming Openshift upgrade and invite specific stakeholders from Platform Services to test their respective components after the upgrade.
  2. Platform Ops executes the upgrade as per standard.
  3. In addition to ensuring no errors, Platform Operations may opt to run an automated "build an app on a specific node" which was a bash script we had in OCP3, shoudl work as-is or with some tweaks for OCP4. That script alone QA's things like "do builds and upload of new images to registry work?" since things like "do pods run?" after often QA'd by existing pods.
  4. We invite any additional stakeholders to test, and depending on the complexity involved, completion might be a check-box on an existing ZenHub ticket, or for more complex testing, they get their own ticket associated with an EPIC involving the node in question.
  5. For all of the tests which are automated (ie: canary apps which just "work" when all is well), we use those when possible. Do recommend anything that's critical that perhaps for the first few rounds there be manual tests so we're not overly reliant on automation that may not be as tested as we'd like.
  6. Once all stakeholders for Platforms Ops/Services approve their respective QA, then the cluster in question is considered Complete.
  7. All of the above will be fully documented both for canary app testing and manual checks so that anyone with the required levels of access can be readily dropped into place to run or review the checks, so no single point of failure aspect for this.
  8. While there may be multiple people involved with doing the maintenance and QA of components, the coordinator or Platform Ops person who initiated the upgrade should remain the Single Point of Contact throughout the RFC for PROD so that folks in RC don't get confused. Service stakeholders can elect to follow up with more specialized support via ops channel or the respective RC channel for their app.
  9. For a major upgrade (4.8 to 4.9) a run of Krakan post-ugprade in LAB is advised, compare against previous runs.
wmhutchison commented 2 years ago

TODO - update this EPIC and move the content into a sub-task.

wmhutchison commented 2 years ago

Closing off this EPIC/Sprint Goal - the work desired involves documentation, development and implementation of Canary applications which will automate most of this on an ongoing basis. Stay tuned for the new Sprint Goal involving Canaries.