Create suitable workflow for how testing should proceed after maintenance actvities

wmhutchison commented 2 years ago

Describe the issue We need to create a suitable framework that can be re-used for maintenance activities and ensure that all stakeholders with the Openshift environments are properly reviewed and evaluated before, during and after each maintenance activity is completed. This will leverage automation as needed, with the option for manual checks to supplement where appropriate or still needed.

Additional context (coming soon)

Definition of done

[ ] Create initial markdown document summarizing the top-level approach for evaluating overall health and comparing baselines.
[ ] Receive peer review/approval among Cluster Operations team (DXC).
[ ] Receive peer review/approval among key stakeholders for Platform Services.
[ ] Commit/merge the PR.

wmhutchison commented 2 years ago

At present the upgrade methodology goes like this.

Platform Ops team communicates their intent to start work on an Openshift upgrade, and proceeds to do so on the LAB clusters.
QA of the LAB clusters is solely handled by Platform Ops, and mainly driven from the concept of whether or not any current AlertManager or Nagios alerts are being created. No other stakeholders are notified or brought into the mix for testing other portions of the cluster.
With a passing grade in LAB, Platform Ops creates suitable RFCs for the PROD clusters, creates RFCs, obtains approval for said RFCs, sends out comms in advance, and then executes as they come up.
QA of PROD clusters follow similar suit to LAB environments.

wmhutchison commented 2 years ago

Ignoring whether or not we're talking about LAB or PROD, we want to strengthen overall communications and testing processes as follows.

Notify internally in advance of an upcoming Openshift upgrade and invite specific stakeholders from Platform Services to test their respective components after the upgrade.
Platform Ops executes the upgrade as per standard.
In addition to ensuring no errors, Platform Operations may opt to run an automated "build an app on a specific node" which was a bash script we had in OCP3, shoudl work as-is or with some tweaks for OCP4. That script alone QA's things like "do builds and upload of new images to registry work?" since things like "do pods run?" after often QA'd by existing pods.
We invite any additional stakeholders to test, and depending on the complexity involved, completion might be a check-box on an existing ZenHub ticket, or for more complex testing, they get their own ticket associated with an EPIC involving the node in question.
For all of the tests which are automated (ie: canary apps which just "work" when all is well), we use those when possible. Do recommend anything that's critical that perhaps for the first few rounds there be manual tests so we're not overly reliant on automation that may not be as tested as we'd like.
Once all stakeholders for Platforms Ops/Services approve their respective QA, then the cluster in question is considered Complete.
All of the above will be fully documented both for canary app testing and manual checks so that anyone with the required levels of access can be readily dropped into place to run or review the checks, so no single point of failure aspect for this.
While there may be multiple people involved with doing the maintenance and QA of components, the coordinator or Platform Ops person who initiated the upgrade should remain the Single Point of Contact throughout the RFC for PROD so that folks in RC don't get confused. Service stakeholders can elect to follow up with more specialized support via ops channel or the respective RC channel for their app.
For a major upgrade (4.8 to 4.9) a run of Krakan post-ugprade in LAB is advised, compare against previous runs.

wmhutchison commented 2 years ago

Moving to Backlog due to me going on vacation. If there's a desire to promote work on this in my absence, can move out of Backlog.

wmhutchison commented 2 years ago

back from vacation, this ticket can resume work.

wmhutchison commented 2 years ago

Moving to Backlog for now, as this is not really the same thing and priority as work spent on the Canary apps for documentation/testing. Making Canary work a priority for now, this will still exist as a reminder for Platform Operations to continue working on it as time permits and keep this item in mind during Openshift upgrades in the future.

wmhutchison commented 2 months ago

Discussed with team during Backlog grooming. It was felt that this ticket no is applicable, since we now track dependencies with automation and docs for creating our ZenHub tickets. Actual handling/tracking of unit tests has now been delegated to the Platform Services team with the use of N8N.

BCDevOps / developer-experience

Create suitable workflow for how testing should proceed after maintenance actvities #2523