IBM / cloud-pak-deployer

Configuration-based installation of OpenShift and Cloud Pak for Data/Integration/Watson AIOps on various private and public cloud infrastructure providers. Deployment attempts to achieve the end-state defined in the configuration. If something fails along the way, you only need to restart the process to continue the deployment.
https://ibm.github.io/cloud-pak-deployer/
Apache License 2.0
133 stars 67 forks source link

cloud-pak-deployer introduces kubelet config conflict on upgrade #487

Closed bhpratt closed 1 year ago

bhpratt commented 1 year ago

Describe the bug In OpenShift 4.10 on IBM Cloud (ROKS) this feature gate is disabled in the kubelet config: CSIMigrationOpenStack: false. In 4.11, this feature gate is enabled so the line is removed from the kubelet config.

When attempting to upgrade a ROKS cluster running cp4d and cloud-pak-deployer the worker node went critical and the kubelet gave this errors: hyperkube[25356]: Error: failed to set feature gates from initial flags-based config: cannot set feature gate CSIMigrationOpenStack to false, feature is locked to true

We eventually determined that the kubelet config was overridden by this configmap: cloud-pak-node-fix-config and this daemonset: cloud-pak-crontab-ds. The configmap still had the old disabled feature gate and the daemonset was applying this configmap to every node which caused the kubelet error.

This appears to come from this code: https://github.com/IBM/cloud-pak-deployer/blob/63b7bc1227c06fad790768e31588815931750605/scripts/cp4d/cp4d-apply-non-mco-cluster-settings.sh#L51 - where the script copies the kubelet config, adds some additional lines, and then updates the configmap. However, when there are upstream changes to the kubelet config, this has no way of seeing those changes. In this case, it continues to port over the old config into 4.11.

We fixed this by removing CSIMigrationOpenStack: false from cloud-pak-node-fix-config configmap and rebooting the node. After that we could upgrade all nodes in the cluster successfully.

To Reproduce Steps to reproduce the behavior:

  1. Create a 4.10 OpenShift cluster in IBM Cloud
  2. Install cp4d + cloud-pak-deployer
  3. Upgrade api server to 4.11
  4. Upgrade worker node to 4.11
  5. See kubelet error: hyperkube[25356]: Error: failed to set feature gates from initial flags-based config: cannot set feature gate CSIMigrationOpenStack to false, feature is locked to true

Expected behavior Worker node is upgraded with the correct kubelet config.

neelashah commented 1 year ago

We have discovered lately there are two additional aspects to add to this original problem.

  1. There are more feature gates going from 4.10 to 4.11 causing these issues.
  2. There are feature gates going from 4.11 to 4.12 that are also causing issues for the same reason.

This is going to be an ongoing issue and if the intent is to continue using the deployer tool for cp4d, these types of changes in the community need to be accounted for in every release and the deployer needs to be kept up to date and in sync with these community changes. Otherwise, this will continue to cause major headaches for customers (for both ROKS and CP4D).

neelashah commented 1 year ago

4.10 --> 4.11 CSIMigrationOpenStack ServiceLBNodePortControl CSIMigrationAzureDisk

There are additional feature flags that became GA in 4.12 in the cloud-pak-node-fix-config configmap that are not compatible with 4.12: CSIMigrationAWS: False CSIMigrationGCE: False CRIContainerLogRotation: true

fketelaars commented 1 year ago

@neelashah I'm working on changing the scripts that update the CRI-O config and Kubelet config. Rather than relying on the configuration at initial deployment time, we will be making the changes through a script on the node. This will just add/insert the config items needed for CP4D.

neelashah commented 1 year ago

@fketelaars This is a good solution for the longer term. Thanks @fketelaars!

fketelaars commented 1 year ago

Implemented the changes, now using scripts to update the kubelet.conf and the crio.conf, rather than storing these config files at initial deployment. Tested upgrade to OCP 4.11 and this worked.

fketelaars commented 1 year ago

Fixed in fk-misc branch.

bhpratt commented 1 year ago

@fketelaars Question - is deployer versioned? If so, is there a version of Deployer we can point customers to that has this fix?

Thanks.

neelashah commented 1 year ago

@fketelaars Thanks for addressing this issue.

In addition to the question above, we need to understand what should customers with deployers already on their env currently do i.e. how do they update the current deployer version on an existing env as we need to tell customer to upgrade their deployer on the cluster BEFORE they attempt to do a worker node update.

fketelaars commented 1 year ago

@neelashah When the new version of deployer is run, it will replace the existing method automatically. Customer just needs to update deployer before they attempt OCP upgrade.

fketelaars commented 1 year ago

@bhpratt Deployer uses continuous delivery. If wanted we can create a version tag you can reference.