IBM / cloud-pak-deployer

Configuration-based installation of OpenShift and Cloud Pak for Data/Integration/Watson AIOps on various private and public cloud infrastructure providers. Deployment attempts to achieve the end-state defined in the configuration. If something fails along the way, you only need to restart the process to continue the deployment.
https://ibm.github.io/cloud-pak-deployer/
Apache License 2.0
140 stars 69 forks source link

Provisioning of OpenShift on vSphere fails #781

Open fketelaars opened 2 months ago

fketelaars commented 2 months ago

Describe the bug When running deployer to provision an OpenShift cluster on vSphere, the following error occurs:

TASK [provision-ipi : Make sure the specified VM folder exists] ****************
Tuesday 10 September 2024  05:22:38 +0000 (0:00:00.026)       0:01:08.235 ***** 
fatal: [localhost]: FAILED! => {"msg": "Could not find imported module support code for ansible_collections.community.vmware.plugins.modules.vcenter_folder.  Looked for (['ansible.module_utils.compat.version.StrictVersion', 'ansible.module_utils.compat.version'])"}

PLAY RECAP *********************************************************************

Solution Remove dependency on the vmware Galaxy collection.

fketelaars commented 3 weeks ago

Commenting out the vcenter_folder and pre-creating the folder in vCenter went past the error. Now hitting an issue creating the OpenShift cluster using openshift-install. OpenShift installer log file:

level=info msg=Not all ingress controllers are available.
level=error msg=Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1: Some pods are not scheduled: Pod "router-default-7dff78bcd6-5k82m" cannot be scheduled: 0/2 nodes are available: 2 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.. Pod "router-default-7dff78bcd6-9r8zm" cannot be scheduled: 0/2 nodes are available: 2 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.. Make sure you have sufficient worker nodes.), CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
level=info msg=Cluster operator ingress EvaluationConditionsDetected is False with AsExpected: 
level=info msg=Cluster operator insights ClusterTransferAvailable is Unknown with : 
level=info msg=Cluster operator insights Disabled is False with AsExpected: 
level=info msg=Cluster operator insights SCAAvailable is Unknown with : 
level=error msg=Cluster operator kube-apiserver Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady: GuardControllerDegraded: Missing operand on node arrow-cluster-nng8p-master-1
level=error msg=NodeControllerDegraded: The master nodes not ready: node "arrow-cluster-nng8p-master-0" not ready since 2024-11-05 15:36:45 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
level=info msg=Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 0 nodes have achieved new revision 5
level=error msg=Cluster operator kube-apiserver Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 2 nodes are at revision 0; 0 nodes have achieved new revision 5
level=info msg=Cluster operator kube-apiserver EvaluationConditionsDetected is False with AsExpected: All is well
level=error msg=Cluster operator kube-controller-manager Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady::StaticPods_Error: GuardControllerDegraded: Missing operand on node arrow-cluster-nng8p-master-1
level=error msg=NodeControllerDegraded: The master nodes not ready: node "arrow-cluster-nng8p-master-0" not ready since 2024-11-05 15:36:45 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
level=error msg=StaticPodsDegraded: pod/kube-controller-manager-arrow-cluster-nng8p-master-0 container "cluster-policy-controller" is waiting: ContainerCreating: 
level=error msg=StaticPodsDegraded: pod/kube-controller-manager-arrow-cluster-nng8p-master-0 container "kube-controller-manager" is waiting: ContainerCreating: 
level=error msg=StaticPodsDegraded: pod/kube-controller-manager-arrow-cluster-nng8p-master-0 container "kube-controller-manager-cert-syncer" is waiting: ContainerCreating: 
level=error msg=StaticPodsDegraded: pod/kube-controller-manager-arrow-cluster-nng8p-master-0 container "kube-controller-manager-recovery-controller" is waiting: ContainerCreating: 
level=info msg=Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 0 nodes have achieved new revision 7
level=error msg=Cluster operator kube-controller-manager Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 2 nodes are at revision 0; 0 nodes have achieved new revision 7
level=info msg=Cluster operator kube-controller-manager EvaluationConditionsDetected is Unknown with NoData: 
level=error msg=Cluster operator kube-scheduler Degraded is True with GuardController_SyncError::NodeController_MasterNodesReady: GuardControllerDegraded: Missing operand on node arrow-cluster-nng8p-master-1
level=error msg=NodeControllerDegraded: The master nodes not ready: node "arrow-cluster-nng8p-master-0" not ready since 2024-11-05 15:36:45 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
level=info msg=Cluster operator kube-scheduler Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 0 nodes have achieved new revision 7
level=error msg=Cluster operator kube-scheduler Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 2 nodes are at revision 0; 0 nodes have achieved new revision 7
level=info msg=Cluster operator kube-scheduler EvaluationConditionsDetected is Unknown with NoData: 
level=info msg=Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.15.37
level=error msg=Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.15.37 because error syncing machine-api-controller: Internal error occurred: admission plugin "image.openshift.io/ImagePolicy" failed to complete mutation in 13s
level=error msg=Cluster operator machine-api Available is False with Initializing: Operator is initializing
level=error msg=Cluster operator machine-config Degraded is True with MachineConfigDaemonFailed: Failed to resync 4.15.37 because: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 2, updated: 2, ready: 1, unavailable: 1)]
level=error msg=Cluster operator machine-config Available is False with MachineConfigDaemonFailed: Cluster not available for [{operator 4.15.37}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 2, updated: 2, ready: 1, unavailable: 1)]
level=info msg=Cluster operator machine-config EvaluationConditionsDetected is False with AsExpected: 
level=error msg=Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded
level=error msg=Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: UpdatingPrometheusOperator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded
level=info msg=Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
level=info msg=Cluster operator network ManagementStateDegraded is False with : 
level=info msg=Cluster operator network Progressing is True with Deploying: DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
level=info msg=DaemonSet "/openshift-network-node-identity/network-node-identity" is not available (awaiting 1 nodes)
level=info msg=DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
level=info msg=DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
level=info msg=DaemonSet "/openshift-multus/multus-additional-cni-plugins" is not available (awaiting 1 nodes)
level=info msg=DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
level=info msg=Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
level=info msg=Deployment "/openshift-ovn-kubernetes/ovnkube-control-plane" is not available (awaiting 1 nodes)
level=info msg=Cluster operator node-tuning Progressing is True with ProfileProgressing: Waiting for 1/2 Profiles to be applied
level=info msg=Cluster operator openshift-apiserver Progressing is True with APIServerDeployment_PodsUpdating: APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: 1/2 pods have been updated to the latest generation
level=info msg=Cluster operator openshift-controller-manager Progressing is True with _DesiredStateNotYetAchieved: Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 2
level=info msg=Progressing: deployment/route-controller-manager: updated replicas is 1, desired replicas is 2
level=error msg=Cluster operator operator-lifecycle-manager-packageserver Available is False with ClusterServiceVersionNotSucceeded: ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallCheckFailed, message: install failed: deployment packageserver not ready before timeout: deployment "packageserver" exceeded its progress deadline
level=info msg=Cluster operator storage Progressing is True with VSphereCSIDriverOperatorCR_VMwareVSphereDriverNodeServiceController_Deploying: VSphereCSIDriverOperatorCRProgressing: VMwareVSphereDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods
level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
level=warning msg=The bootstrap machine is unable to resolve API and/or API-Int Server URLs
level=info msg=    root : PWD=/var/opt/openshift ; USER=root ; ENV=KUBECONFIG=/opt/openshift/auth/kubeconfigCOMMAND=/bin/oc --request-timeout=5s get events --all-namespaces -o json
level=info msg=    root : PWD=/var/opt/openshift ; USER=root ; ENV=KUBECONFIG=/opt/openshift/auth/kubeconfigCOMMAND=/bin/oc --request-timeout=5s get machineconfigs -o json
level=info msg=    root : PWD=/var/opt/openshift ; USER=root ; ENV=KUBECONFIG=/opt/openshift/auth/kubeconfigCOMMAND=/bin/oc --request-timeout=5s get nodes -o json
level=info msg=Bootstrap gather logs captured here "/root/cpd-status/vsphere-ipi/arrow-cluster/log-bundle-20241105155443.tar.gz"