OCP 4.12 Upgrade ncp image to latest supported version in KLAB2

wmhutchison commented 1 year ago

Describe the issue The ncp operator and related software is the glue which integrates Openshift with NSX. A technical blocker prevented us previously from upgrading to Openshift 4.12 due to ncp not yet supporting OCP 4.12. That is no longer the case.

Additional context This ticket can be executed at any time so long as it is before an attempt to upgrade Openshift past version 4.10.

Definition of done

[x] ncp upgrade to latest supported version.

wmhutchison commented 1 year ago

Asked Cailey for new Artifactory service accounts/repo which will be used specifically for storing this ncp image, and also be leveraged for other images the Platform Ops team maintains.

wmhutchison commented 1 year ago

First time reviewing the zip download from VMware for ncp. Had some confusing YAML (20 files' worth) but that seems to be just an internal audit of all resources created/managed, since many of those resources are created/managed by the operator direclty.

https://github.com/vmware/nsx-container-plugin-operator/tree/main/deploy/openshift4 is the place to be for final audits before we attempt an ncp upgrade in KLAB2. Seems the only thing we need to really check over is the configmap resource. Everything else is more or less the same as before.

wmhutchison commented 1 year ago

First time reviewing the zip download from VMware for ncp. Had some confusing YAML (20 files' worth) but that seems to be just an internal audit of all resources created/managed, since many of those resources are created/managed by the operator direclty.

https://github.com/vmware/nsx-container-plugin-operator/tree/main/deploy/openshift4 is the place to be for final audits before we attempt an ncp upgrade in KLAB2. Seems the only thing we need to really check over is the configmap resource. Everything else is more or less the same as before.

wmhutchison commented 1 year ago

need to also remember to add the new Artifactory service account as a new global pull secret to KLAB2.

wmhutchison commented 1 year ago

Old image values in KLAB2:

Image: nsx-container-plugin-operator:v4.0.1
NCP_IMAGE: ncp3x:401ubi

new image values:

Image: nsx-container-plugin-operator:v4.1.1
NCP_IMAGE: nsx-ncp-ubi:4.1.1.0

wmhutchison commented 1 year ago

Ran into an issue with the new ncp operator image, which generated the following error when ran.

$ oc -n nsx-system-operator logs nsx-ncp-operator-75d7dc4668-qkgld
nsx-ncp-operator: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by nsx-ncp-operator)
nsx-ncp-operator: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by nsx-ncp-operator)

Dockerhub has that new image published a month ago. They do however have a newer image tagged just as "latest". When adjusting the tag to "latest", this issue was resolved.

wmhutchison commented 1 year ago

Ran into some issues as well with the roll-out of the new nsx pods for agent/bootstrap on the various nodes. One of them involved one of the masters, so did a node drain/reboot. That did unplug the roll-out issue, but caused new problems with API/etcd availability since roll-out also continued on other masters, thus affecting over-all stability for a short while.

Lesson-learned, for this scenario in the future (especially for EMERALD), do not do a node drain/reboot unless no other options work. Start instead by deleting the affected NSX pods instead if roll-out is stuck.

BCDevOps / developer-experience

OCP 4.12 Upgrade ncp image to latest supported version in KLAB2 #4264