bcgov / digital-journeys

PSA Forms System
https://bcgov.github.io/digital-journeys/
Apache License 2.0
8 stars 7 forks source link

Documentation/Testing/Communication for OpenShift Upgrades #1319

Open warrenchristian1telus opened 1 year ago

warrenchristian1telus commented 1 year ago

This is a placeholder (maybe Epic?) for tasks relating to OpenShift upgrades. (oct 16 - Nov 3)

Options for minimizing impacts to users and team include:

1) Testing 2) Messaging 3) Server Enhancements

Testing can be done by disabling various services and identifying opportunities to improve messaging (errors, notifications) and workflow enhancements (block form submissions if database is unavailable, direct users to form drafts on submission failure, etc.).

Messaging could range from email communications to site upgrade notifications, errors, or support request scripts.

Server enhancements are underway, and are expected to include additional redundancy (x3) for each server / service. This will rely on successful upgrade tasks as well as sufficient namespace resources (CPU, memory) to accommodate the additional pods.

We will likely want to run test phases before and after the upgrade to document improvements, and identify any messaging or support procedure changes. For example, if upgrades are no longer expected to cause outages for users, we may want to disable up-front communications such as emails or site warnings/notifications during upgrades.

MeghanStothers commented 11 months ago

There are several emails about this upgrade attached to this ticket Sept 29 update:

What is happening?

• CHG0052476 - MCS SILVER - 2023 Q4 Patching - Hard Prune of Image registry, ETCD Defrag and Trident Upgrade • CHG0052425 - MCS SILVER - 2023 Q4 Patching - Upgrade to OCP 4.13.12 • CHG0052477 - MCS GOLD - 2023 Q4 Patching - Hard Prune of Image registry, ETCD Defrag and Trident Upgrade • CHG0052479 - MCS GOLD - 2023 Q4 Patching - Upgrade to OCP 4.13.12 • CHG0052478 - MCS GOLDDR - 2023 Q4 Patching - Hard Prune of Image registry, ETCD Defrag and Trident Upgrade • CHG0052480 - MCS GOLDDR - 2023 Q4 Patching - Upgrade to OCP 4.13.12

These six changes represent the work involved to upgrade the production clusters from Openshift 4.12 to 4.13.

When?

CHG0052476 - MCS SILVER - 2023 Q4 Patching - Hard Prune of Image registry, ETCD Defrag and Trident Upgrade will commence at 0600 on Sunday October 15th and run as late as 0900.

CHG0052425 - MCS SILVER - 2023 Q4 Patching - Upgrade to OCP 4.13.12 will commence at 9:00am on Monday October 16th and run until November 3rd. Upgrade efforts will be allowed to proceed between 9:00am and 5:00pm, and be paused after-hours to ensure developers are present during node drains to look after their apps if needed.

CHG0052477 - MCS GOLD - 2023 Q4 Patching - Hard Prune of Image registry, ETCD Defrag and Trident Upgrade will commence at 0600 on Sunday November 5th and run as late as 0900.

CHG0052479 - MCS GOLD - 2023 Q4 Patching - Upgrade to OCP 4.13.12 will commence at 9:00am on November 6th and run until November 10th. Upgrade efforts will be allowed to proceed between 9:00am and 5:00pm, and be paused after-hours to ensure developers are present during node drains to look after their apps if needed.

CHG0052478 - MCS GOLDDR - 2023 Q4 Patching - Hard Prune of Image registry, ETCD Defrag and Trident Upgrade will commence at 0600 on Sunday November 19th and run as late as 0900.

CHG0052480 - MCS GOLDDR - 2023 Q4 Patching - Upgrade to OCP 4.13.12 will commence at 9:00am on Monday November 20th and run until November 24th. Upgrade efforts will be allowed to proceed between 9:00am and 5:00pm, and be paused after-hours to ensure developers are present during node drains to look after their apps if needed.

Will there be an impact on the Platform apps?

Changes CHG0052476, CHG0052477 and CHG0052478 will involve orphaned image data cleanup, defragmenting the ETCD database, and upgrading the storage provisioner tool Trident. Defragmenting ETCD could cause some latency and slowness but isn't expected to cause any breakage as each pod will be done individually with time to recover between operations. The hard prune will put the registry in read-only mode briefly while the pruner removes any orphaned images. The Trident upgrade may cause a brief window where new storage cannot be allocated, expected to be under 5 minutes. It will not impact existing storage.

Changes CHG0052425, CHG0052479 and CHG0052480 will involve both an upgrade of the core of the cluster as well as draining/reboots of all nodes which will require pods to be drained onto other nodes. There may be some slowness for new pods coming up if it happens that a lot of new pods attempt to start on an empty node (just patched). We will announce daily when work for this will resume and be suspended at the end of each work day. Depending on availability of new firmware, there may be a need for us to do another node drain after Openshift upgrade for firmware.

There are a few items to note about this upgrade that could cause issues for some workloads:

There are three Kubernetes resources affected by this upcoming upgrade, but only one applies to people performing development work for applications hosted inside Openshift. The applicable resource is the following:

HorizontalPodAutoscaler • API version changes from v2beta2 to v2. • No other changes to note.

How to Check Your Application for API Compatibility Issues:

Audit existing Manifest content - If any of your application manifests are stored as YAML or JSON, then you can search your manifest files yourself to first look for matches against Kubernetes resource types of HorizontalPodAutoscaler and an API version of v2beta2.

Do I need to do anything?

Monitor your applications to ensure they continue working as expected.

Check the Rocketchat #devops-alert channel for the announcement of when the change is complete and check the health of your app.

Where do I get help if my app doesn't work after the change is complete?

Each Platform application has an assigned DevOps Specialist within the Ministry so contact them first. If you don't know who your assigned DevOps Specialist, check with the app's Product Owner.

The DevOps Specialist will troubleshoot the issue with the app and if they need help, they will reach out to the Platform Services Team and the Developer Community in Rocket.Chat as per these RocketChat Channel Use Guidelines.

MeghanStothers commented 11 months ago

@Stella FYI for now as we'll keep these updates in mind as part of our launch planning and to flag any testing/technical work we'd need to do following these Open Shift upgrades (dependencies for mat/pat launch)

fazil-ey commented 11 months ago

Potential idea -

@warrenchristian1telus - gentle failure message

MeghanStothers commented 11 months ago

Stella and Meghan to work on a comms plan and understand what is possible in terms of active messaging.

fazil-ey commented 11 months ago

1386 has ideas on improving health checks

fazil-ey commented 11 months ago

@Stella-Archer to document Comms Plan/Checklist. Once we have that we can close this ticket. Mike to work on this as well.