BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

SDN - Issues regarding available pod IP ranges on EMERALD #4744

Closed wmhutchison closed 4 months ago

wmhutchison commented 7 months ago

Describe the issue In recent weeks, due to how NSX technology allocates dedicated unique IP ranges to each namespace for use by pods, an inability to provision new pods was noted and cause was determined to be IP exhaustion since the assigned /16 block had been entirely consumed by all of the hundreds of namespaces currently in place and being added due to new interest in this cluster.

This ticket will track efforts spent troubleshooting and resolving this issue using vendor support (VMWare) to help move this along. Work-arounds are also being pursued to help block current/immediate namespace provisioning needs while troubleshooting proceeds.

Blocked By The technical need to upgrade to Openshift 4.14 has one final technical block, which is a supporting version of ako for OCP 4.14 , or clarification from vendor support confirming if the latest ako version will support OCP 4.14 despite vendor support docs stating support only to 4.13.

Additional context Vendor case with VMWare is SR 24508863403.

How does this benefit the users of our platform? Ensures a stable platform where users can provision their workloads with confidence.

KLAB2 OCP Upgrade Prep

Definition of done

wmhutchison commented 7 months ago

The vendor case was opened early April when the need to add more pod IP ranges to EMERALD arose, and after receiving a new network range from the OCIO (we cannot and do not choose IP ranges, OCIO gate-keeps them all to ensure no collisions), we attempted to test adding the newly allocated pod IP range and see if KLAB2 would take the range. The result was no real errors, but no ability to see or use the new pod IP range, thus the vendor case mentioned was opened on this matter.

wmhutchison commented 7 months ago

The vendor case was opened early April when the need to add more pod IP ranges to EMERALD arose, and after receiving a new network range from the OCIO (we cannot and do not choose IP ranges, OCIO gate-keeps them all to ensure no collisions), we attempted to test adding the newly allocated pod IP range and see if KLAB2 would take the range. The result was no real errors, but no ability to see or use the new pod IP range, thus the vendor case mentioned was opened on this matter.

wmhutchison commented 7 months ago

With on-going back-and-forth between Platform Operations and VMWare support, asking/answering questions and including a few Zoom call screen-shares to speed up the process, the resulting answer we received from VMWare was that the change we thought we could make to our NSX configuration file supports adding the network and not worry about anything else since that is what the vendor docs suggests.

Unfortunately that doesn't apply to the Openshift method of integrating with NSX, and the answer given to us was that we could add this new pod IP range if we add it to a specific Openshift network object.

The issue though is that the mentioned Openshift network object is something that normally can only be set during the initial installation of an Openshift cluster and not changed later. This however appears to have changed as of Openshift version 4.14 which does allow the pod IP range changes we want to implement. https://docs.openshift.com/container-platform/4.14/networking/configuring-cluster-network-range.html

The next challenge then is to first upgrade KLAB2 from Openshift 4.12 to 4.14 and re-test to make sure this addresses our issues, and then proceed with doing the same to EMERALD. Many but not all of the technical issues have been already checked regarding NSX compatibility, but other checks are needed including whether or not our current Trident blocker for regular Openshift will affect us with the NSX-backed Openshift clusters.

If all goes well, am hoping to proceed with upgrading KLAB2 to OCP 4.14 early next week (April 15th) and if all remains good with no issues, upgrade EMERALD the following week.

An additional challenge here is that normally regular Openshift is ahead version-wise from NSX-backed Openshift. However due to the Trident-related upgrade blocks, KLAB2/EMERALD will be the first clusters to upgrade to Openshift 4.14, so we won't be able to harvest upgrade "lessons-learned" from regular Openshift like we normally do to move forward, we'll be dealing with those possible issues due to version changes in KLAB2/EMERALD first. Due to much fewer license plates involved with EMERALD, expected impact to user workloads is expected to be much less than what we'd deal with for a regular Openshift upgrade.

wmhutchison commented 7 months ago

While all of this work proceeds, we can help unblock a small number of license plate requests on EMERALD by virtue of deleting non-key internal namespaces which will re-create using a smaller network block. Doing this for four existing namespaces will free up three license plates' worth of new product space on EMERALD, so can do this once or twice, but will likely run out of things to delete in this fashion.

As of April 10th, there currently resides capacity for three new license plates, will notify internally so that folks involved with approving license plates are aware of this and can approve once other items like onboarding meetings are handled.

wmhutchison commented 7 months ago

At present work is underway to upgrade KLAB2 from Openshift 4.12 to 4.14. Currently working on the technical dependencies first and then if all goes to plan, kick off KLAB2's upgrade to 4.14 next week (April 15th). Once upgraded, we can circle back and confirm VMWare's assertations that the Openshift upgrade does in fact resolve our issue. If it does and no new technical issues arise due to the OCP 4.14 upgrade, we can schedule an RFC to upgrade EMERALD as well.

wmhutchison commented 7 months ago

Noted the following when reviewing the Kyverno CCM app manifest to see what's being installed, versus what vendor docs state: https://kyverno.io/docs/installation/#compatibility-matrix

OCP 4.14 uses Kubernetes v1.27, which according to the link, means we need to run Kyverno v1.11.x to remain compatible. Though historically it has been noted that our OCP versions were sometimes out of band of what Kyverno officially supported and things were fine, so might not be a major concern as of yet, might be just as simple as "need newer K8S to support newer features, but existing features are just fine". Will need to do some core testing on the more unique NSX-based policies which we do care about.

At present CCM installs Kyverno v1.7 for KLAB2/EMERALD, and v1.9 everywhere else.

wmhutchison commented 7 months ago

There may be other compatibility issues that need to be addressed in CCM, but the only one of major interest was kyverno in KLAB2. For the rest, it'll be a case of observing post-upgrade and working through what else needs to be changed/upgraded.

wmhutchison commented 7 months ago

https://github.com/bcgov-c/platform-ops/pull/489 created for formal PR changes involving what is needed for the Trident upgrade, a prerequisite before upgrading Openshift. Just finalizing additional up-front steps not included in the PR and playbook used for Trident upgrades, and then should be good to go regarding upgrading Trident on KLAB2.

wmhutchison commented 7 months ago

Trident successfully upgraded on April 17th 2024, vendor docs stated incorrect image version values which were quickly identified and fixed during the upgrade process, PR updated so that this issue will not arise on any future OCP 4.14 upgrade work.

Regarding CCM apps which contain critical funcitonality which must work post-upgrade, Kyverno was identifed as such a component that needs to be upgraded first to v1.11.x before attempting an OCP upgrade. Jason Leach was pinged, but due to not being available Ian Watts has volunteered to help in his stead. A preliminary attempt to upgrade Kyverno was put into place on KLAB2. Work currently in progress by Platform Ops and Ian Watts to test functionality of this upgrade to ensure all of the issues are worked out first before putting in a formal PR to upgrade KLAB2 Kyverno. Once this is successfully done, we will be able to push forward with the upgrade of KLAB2 to Openshift 4.14.

wmhutchison commented 7 months ago

https://github.com/bcgov-c/platform-gitops-gen/pull/846 created by Ian Watts to formalize the Kyverno upgrade in KLAB2 to a version that will support Openshift 4.14. The intent is to finish approval/merge/applying this formal CCM update in KLAB2 before EOD today so that the last technical blocker for upgrading KLAB2 to Openshift 4.14 is removed.

wmhutchison commented 7 months ago

Another technical blocker and support ticket for VMWare identified as well, which is the ako operator. Working on upgrading ako in KLAB2 to the very latest, but vendor docs states that said latest only supports up to OCP 4.13, and we need OCP 4.14. Will see what they have to say about this, since they do support Kubernetes versions newer than what is going to be present on OCP 4.14, so want to know if there are other considerations involved here, or perhaps ako will work on OCP 4.14 after all.

wmhutchison commented 6 months ago

VMWare support has responded. A version of ako that will support Openshift 4.14 (ako v1.12) is due to be released in the next few weeks, no precise date other than that given. Will follow up in a week to determine if there's a more precise release date in mind.

Having gone through an ako upgrade just recently, barring any major internal changes this additional upgrade should be fairly fast (one business day or less), and then we're off to upgrade KLAB2 to Openshift 4.14 and re-attempt pod IP additions.

wmhutchison commented 6 months ago

A recent migration from VMWare web portal to the new owner (Broadcom) web portal has taken place. Re-checked to confirm current docs and downloads links post-change.

Docs: https://docs.vmware.com/en/VMware-NSX-Advanced-Load-Balancer/index.html Downloads: https://support.broadcom.com/group/ecx/productdownloads?subfamily=VMware%20Avi%20Load%20Balancer

While both in theory should be updated at the same time during a new version release, we have seen occasions where downloads gets updated first while docs lags behind.

wmhutchison commented 6 months ago

https://github.com/bcgov-c/platform-ops/pull/490 had its README updated since it was the last upgrade of AKO on KLAB2 (not yet applied to EMERALD), and following said README would be sufficient as documentation when ako drops.

wmhutchison commented 6 months ago

Since we consume ako via generating manifests from helm, helm on EMERALD-UTIL is the best way of proactively checking to see if ako 1.12 has dropped yet or not, since not sure how fast official docs are updated on new release.

EMERALD/openshift-config ~ $ helm show chart oci://projects.registry.vmware.com/ako/helm-charts/ako --version 1.12.0
Error: failed to download "oci://projects.registry.vmware.com/ako/helm-charts/ako" at version "1.12.0"
EMERALD/openshift-config ~ $ helm show chart oci://projects.registry.vmware.com/ako/helm-charts/ako --version 1.12
Error: failed to download "oci://projects.registry.vmware.com/ako/helm-charts/ako" at version "1.12"
EMERALD/openshift-config ~ $ helm show chart oci://projects.registry.vmware.com/ako/helm-charts/ako --version 1.11.3
apiVersion: v2
appVersion: 1.11.3
description: A helm chart for Avi Kubernetes Operator
name: ako
type: application
version: 1.11.3

EMERALD/openshift-config ~ $
wmhutchison commented 6 months ago

VMWare just announced that ako 1.12 (1.12.1 according to docs) just dropped today. proceeding with upgrading ako in KLAB2 and waiting for some bake-in time (over the weekend should be fine). If all is well post-upgrade, then KLAB2 will be officially unblocked for an Openshift 4.14 upgrade.

wmhutchison commented 6 months ago

https://github.com/bcgov-c/platform-ops/pull/491 has updated content for ako 1.12.1, but has yet to be applied to KLAB2.

https://access.redhat.com/articles/7050846 is a stand-alone doc for handling EUS upgrades from 4.12 to 4.14 which hasn't been gone through in great detail.

https://docs.openshift.com/container-platform/4.14/updating/updating_a_cluster/eus-eus-update.html is the general document on how an Openshift EUS upgrade works, may or may not 100% match what was documented for the 4.10 -> 4.12 upgrade.

StevenBarre commented 4 months ago

Emerald upgraded to 4.14. Waiting for a network change tonight to allow routing of our new subnet and will test.

StevenBarre commented 4 months ago

Opened INC0096893 as traffic to the FP still isn't flowing

StevenBarre commented 4 months ago

Change RFC N-C02123317 has been created for implementation Monday June 24, 2024.

StevenBarre commented 4 months ago

Tested again and reached out to network for next troubleshooting steps

StevenBarre commented 4 months ago

Tested and working!