IBM / cloud-pak-deployer

Configuration-based installation of OpenShift and Cloud Pak for Data/Integration/Watson AIOps on various private and public cloud infrastructure providers. Deployment attempts to achieve the end-state defined in the configuration. If something fails along the way, you only need to restart the process to continue the deployment.
https://ibm.github.io/cloud-pak-deployer/
Apache License 2.0
140 stars 69 forks source link

cp-deployer failed with multiple issues. It gone away post re-run the scripts. #831

Open barochiarg opened 1 week ago

barochiarg commented 1 week ago

Describe the bug Multiple minor issues have been observed.

Issue 1: Post creation of ROSA cluster, cp-deployer is failed to login to ROSA cluster. It retried till end. Post this, I was able to login to ROSA cluster. It would be good to increase number of retries.

FAILED - RETRYING: Login to OpenShift ROSA cluster (2 retries left). FAILED - RETRYING: Login to OpenShift ROSA cluster (1 retries left).

Issue 2: cp-deployer is failed with below error message. It gone away post re-run the script.

TASK [odf-operator : Retrieve default channel for ocs-operator manifest] task path: /cloud-pak-deployer/automation-roles/40-configure-infra/odf-operator/tasks/main.yml:26 Friday 08 November 2024 08:51:25 +0000 (0:00:00.497) 0:56:40.263 fatal: [localhost]: FAILED! => changed=true cmd: oc get packagemanifest ocs-operator -o jsonpath='{.status.defaultChannel}' delta: '0:00:00.222956' end: '2024-11-08 08:51:25.780681' msg: non-zero return code rc: 1 start: '2024-11-08 08:51:25.557725' stderr: 'Error from server (NotFound): packagemanifests.packages.operators.coreos.com "ocs-operator" not found' stderr_lines: stdout: '' stdout_lines:

PLAY RECAP ***** localhost : ok=590 changed=88 unreachable=0 failed=1 skipped=265 rescued=0 ignored=0

Friday 08 November 2024 08:51:25 +0000 (0:00:00.544) 0:56:40.807 ***

provision-aws : Waiting for cluster creation to complete, logs are in /home/ec2-user/cpd-status/log/r-wa-d01-create-cluster.log 2821.15s /cloud-pak-deployer/automation-roles/30-provision-infra/provision-aws/tasks/provision-rosa.yml:37 openshift-login : Login to OpenShift ROSA cluster --------------------- 281.22s /cloud-pak-deployer/automation-roles/99-generic/openshift/openshift-login/tasks/aws-login-rosa-ocp.yml:39 provision-aws : Create ROSA cluster, logs can be found in /home/ec2-user/cpd-status/log/r-wa-d01-create-cluster.log -- 33.82s /cloud-pak-deployer/automation-roles/30-provision-infra/provision-aws/tasks/provision-rosa.yml:27 openshift-download-installer : Unpack OpenShift installer -------------- 17.52s /cloud-pak-deployer/automation-roles/99-generic/openshift/openshift-download-installer/tasks/main.yml:36 aws-download-cli : Unpack aws-cli client installer --------------------- 17.45s /cloud-pak-deployer/automation-roles/99-generic/aws/aws-download-cli/tasks/main.yml:33 nfs-storage-class : Wait 15 seconds for the dynamic NFS client to deploy -- 15.04s /cloud-pak-deployer/automation-roles/40-configure-infra/nfs-storage-class/tasks/create-nfs-storage-class.yml:52 openshift-download-installer : Download OpenShift installer "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/latest-4.15/openshift-install-linux.tar.gz" -- 14.60s /cloud-pak-deployer/automation-roles/99-generic/openshift/openshift-download-installer/tasks/main.yml:24 cpd-cli-download : Unpack cpd-cli from /home/ec2-user/cpd-status/downloads/cpd-cli-linux-amd64.tar.gz -- 13.03s /cloud-pak-deployer/automation-roles/99-generic/cpd-cli/cpd-cli-download/tasks/main.yml:23 aws-download-cli : Install aws client ----------------------------------- 5.77s /cloud-pak-deployer/automation-roles/99-generic/aws/aws-download-cli/tasks/main.yml:39 openshift-download-client : Unpack OpenShift client from /home/ec2-user/cpd-status/downloads/openshift-client-linux.tar.gz-4.15 --- 4.94s /cloud-pak-deployer/automation-roles/99-generic/openshift/openshift-download-client/tasks/main.yml:38

==================================================================================== Deployer FAILED. Check previous messages. If command line is not returned, press

As per discussion over slack, please find the output as below

oc get packagemanifest ocs-operator

NAME CATALOG AGE ocs-operator Red Hat Operators 35m

oc get pods -n openshift-marketplace

NAME READY STATUS RESTARTS AGE certified-operators-k2rkd 1/1 Running 0 5m48s community-operators-4b9w4 1/1 Running 0 5m49s marketplace-operator-6c6dccc45d-nflmh 1/1 Running 0 2m2s redhat-marketplace-rsfzc 1/1 Running 0 6m8s redhat-operators-v5svc 1/1 Running 0 5m45s

fketelaars commented 1 week ago

I was able to reproduce the issue #2 and included a retry.

TASK [odf-operator : Retrieve default channel for ocs-operator manifest] *******
Sunday 10 November 2024  07:51:56 +0000 (0:00:00.434)       0:48:32.472 *******
FAILED - RETRYING: Retrieve default channel for ocs-operator manifest (30 retries left).
FAILED - RETRYING: Retrieve default channel for ocs-operator manifest (29 retries left).
FAILED - RETRYING: Retrieve default channel for ocs-operator manifest (28 retries left).
FAILED - RETRYING: Retrieve default channel for ocs-operator manifest (27 retries left).
FAILED - RETRYING: Retrieve default channel for ocs-operator manifest (26 retries left).
FAILED - RETRYING: Retrieve default channel for ocs-operator manifest (25 retries left).
FAILED - RETRYING: Retrieve default channel for ocs-operator manifest (24 retries left).
FAILED - RETRYING: Retrieve default channel for ocs-operator manifest (23 retries left).
FAILED - RETRYING: Retrieve default channel for ocs-operator manifest (22 retries left).
changed: [localhost]

For issue #1, I doubled the number of retries based on input from @barochiarg .

fketelaars commented 1 week ago

Issue fixed