IBM / cloud-pak-deployer

Configuration-based installation of OpenShift and Cloud Pak for Data/Integration/Watson AIOps on various private and public cloud infrastructure providers. Deployment attempts to achieve the end-state defined in the configuration. If something fails along the way, you only need to restart the process to continue the deployment.
https://ibm.github.io/cloud-pak-deployer/
Apache License 2.0
136 stars 67 forks source link

WA and WD installation is getting failed. #577

Open barochiarg opened 10 months ago

barochiarg commented 10 months ago

Describe the bug Watson Assistance installation is failed

To Reproduce While installing the watson-assistance using cloud-pak-deployer on AWS environment, installation is failed with below message. It's also not creating namespace "openshift-storage". Same issue may present for watson-discovery too.

TASK [cp4d-cartridge-install : Set up Multicloud Object Gateway (MCG) secrets for watson_assistant in CP4D project cpd, logs are in /home/ec2-user/cpd-status/log/cpd-watson_assistant-setup-mcg.log] * Thursday 09 November 2023 07:34:20 +0000 (0:00:00.051) 0:26:42.451 *** fatal: [localhost]: FAILED! => {"changed": true, "cmd": "set -o pipefail\nsetup-mcg \\n --components=watson_assistant \\n --cpd_instance_ns=cpd \\n --noobaa_account_secret=noobaa-admin \\n --noobaa_cert_secret=noobaa-s3-serving-cert | tee /home/ec2-user/cpd-status/log/cpd-watson_assistant-setup-mcg.log\n", "delta": "0:00:00.147544", "end": "2023-11-09 07:34:21.267867", "msg": "non-zero return code", "rc": 1, "start": "2023-11-09 07:34:21.120323", "stderr": "Error from server (NotFound): secrets \"noobaa-admin\" not found", "stderr_lines": ["Error from server (NotFound): secrets \"noobaa-admin\" not found"], "stdout": "Running the setup for the watson_assistant component using the cpd project.", "stdout_lines": ["Running the setup for the watson_assistant component using the cpd project."]}

PLAY RECAP ***** localhost : ok=1235 changed=148 unreachable=0 failed=1 skipped=575 rescued=0 ignored=0

Thursday 09 November 2023 07:34:21 +0000 (0:00:00.411) 0:26:42.862 *****

cp4d-scheduling-service : Run scheduler installation script, output can be found in /home/ec2-user/cpd-status/log/cpd-apply-scheduler.log - 308.72s cp4d-cluster : Run script to setup instance topology, output can be found in /home/ec2-user/cpd-status/log/cpd-setup-instance-topology.log - 205.80s cp4d-subscriptions : Run apply-olm command to install cartridge subscriptions, logs are in /home/ec2-user/cpd-status/log/cpd-apply-olm-cartridge-sub.log - 183.72s cp-fs-cluster-components : Run shell script to apply cluster components, logs are in /home/ec2-user/cpd-status/log/cpd-apply-cluster-components.log - 176.57s cp4d-catalog-source : Run apply-olm command to create catalog sources, logs are in /home/ec2-user/cpd-status/log/apply-olm-create-catsrc.log - 173.82s cp4d-catalog-source : Generate preview script to create catalog sources, logs are in /home/ec2-user/cpd-status/log/apply-olm-create-catsrc.log - 102.04s cp4d-subscriptions : Generate preview script to install cartridge subscriptions, logs are in /home/ec2-user/cpd-status/log/cpd-apply-olm-cartridge-sub.log -- 30.60s cp4d-cluster : Run apply-cr command to install Cloud Pak for Data platform, logs are in /home/ec2-user/cpd-status/log/cpd-apply-cr-cpd-platform.log -- 24.82s cp4d-cluster : Run script to authorize instance, output can be found in /home/ec2-user/cpd-status/log/cpd-authorize-instance.log -- 17.93s cp4d-cluster : Generate preview script to install Cloud Pak for Data platform, logs are in /home/ec2-user/cpd-status/log/cpd-apply-cr-cpd-platform.log -- 15.52s openshift-download-installer : Unpack OpenShift installer -------------- 15.39s cpd-cli-download : Unpack cpd-cli from /home/ec2-user/cpd-status/downloads/cpd-cli-linux-amd64.tar.gz -- 12.66s aws-download-cli : Unpack aws-cli client installer ---------------------- 7.72s openshift-download-client : Unpack OpenShift client from /home/ec2-user/cpd-status/downloads/openshift-client-linux.tar.gz-4.12 --- 5.20s openshift-download-client : Unpack OpenShift client from /home/ec2-user/cpd-status/downloads/openshift-client-linux.tar.gz-4.12 --- 3.38s ibm-pak-download : Extract ibm-pak from /home/ec2-user/cpd-status/downloads/oc-ibm_pak-linux-amd64.tar.gz --- 3.36s openshift-download-client : Unpack OpenShift client from /home/ec2-user/cpd-status/downloads/openshift-client-linux.tar.gz-4.12 --- 3.26s cloudctl-download : Unpack cloudctl from /home/ec2-user/cpd-status/downloads/cloudctl-linux-amd64.tar.gz --- 3.03s cp4d-cluster : Run apply-entitlement command ---------------------------- 2.62s cp4d-variables : Add versions details from olm-utils -------------------- 2.60s

==================================================================================== Deployer FAILED. Check previous messages. If command line is not returned, press ^C.

Expected behavior WA should install successfully.

Desktop (please complete the following information): AWS environment - self-managed and ROSA openshift

barochiarg commented 10 months ago

related issue: https://github.com/IBM/cloud-pak-deployer/issues/493

fketelaars commented 10 months ago

We found that MCG does not get deployed when STS is used for authentication:

time="2023-11-20T08:11:50Z" level=info msg="✅ RPC: system.update_endpoint_group() Response OK: took 0.3ms"
time="2023-11-20T08:11:50Z" level=info msg="✈️  RPC: redirector.register_to_cluster() Request: <nil>"
time="2023-11-20T08:11:50Z" level=info msg="✅ RPC: redirector.register_to_cluster() Response OK: took 0.2ms"
time="2023-11-20T08:11:50Z" level=info msg="❌ Not Found: BackingStore \"noobaa-default-backing-store\"\n"
time="2023-11-20T08:11:50Z" level=info msg="CredentialsRequest \"noobaa-aws-cloud-creds\" created. Creating default backing store on AWS objectstore" func=ReconcileDefaultBackingStore sys=openshift-storage/noobaa
time="2023-11-20T08:11:50Z" level=info msg="❌ Not Found:  \"noobaa-aws-cloud-creds-secret\"\n"
time="2023-11-20T08:11:50Z" level=info msg="Secret \"noobaa-aws-cloud-creds-secret\" was not created yet by cloud-credentials operator. retry on next reconcile.." sys=openshift-storage/noobaa
time="2023-11-20T08:11:50Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa
time="2023-11-20T08:11:50Z" level=warning msg="⏳ Temporary Error: cloud credentials secret \"noobaa-aws-cloud-creds-secret\" is not ready yet" sys=openshift-storage/noobaa
time="2023-11-20T08:11:50Z" level=info msg="UpdateStatus: Done generation 2" sys=openshift-storage/noobaa
fketelaars commented 10 months ago

After some research, this turns out to be the same issue as #310. When trying to provision ODF, the default backing store is not created and the CredentialsRequest does not result in the creation of a secret for the NooBaa operator.

fketelaars commented 8 months ago

Steps to reproduce the issue:

Manually provisioning an OpenShift cluster on AWS with temporary credentials (STS)

Set environment variables

export AWS_REGION=eu-central-1
export AWS_CFG_DIR=~/aws
export OCP_CLUSTER_NAME=aws-sts
export OCP_DOMAIN_NAME=deployer-demo.eu

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_access_key

Create directories

mkdir -pv $AWS_CFG_DIR

Download installer and client

mkdir -pv $AWS_CFG_DIR/downloads

curl -sLo $AWS_CFG_DIR/downloads/openshift-install-linux.tar.gz https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable-4.12/openshift-install-linux.tar.gz
tar xvzf ${AWS_CONFIG}/downloads/openshift-install-linux.tar.gz -C ~/bin/

Prepare permanent credentials

In case you want to run the process multiple times, it is best to have a script to reset the AWS credentials to the permanent ones, after which you can generate new temporary credentials.

cat << EOF > $AWS_CFG_DIR/aws-reset-creds.sh
export KUBECONFIG=${AWS_CFG_DIR}/${OCP_CLUSTER_NAME}/auth/kubeconfig
export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY
unset AWS_SESSION_TOKEN
EOF

Reset environment

rm -rf  $AWS_CFG_DIR/$OCP_CLUSTER_NAME
mkdir -pv  $AWS_CFG_DIR/$OCP_CLUSTER_NAME
source $AWS_CFG_DIR/aws-reset-creds.sh

Generate AWS STS token

printf "\nexport AWS_ACCESS_KEY_ID=%s\nexport AWS_SECRET_ACCESS_KEY=%s\nexport AWS_SESSION_TOKEN=%s\n" $(aws sts assume-role \
--role-arn arn:aws:iam::872255850422:role/fk-sts-role \
--role-session-name OCPInstall \
--query "Credentials.[AccessKeyId,SecretAccessKey,SessionToken]" \
--output text) > /tmp/sts-credentials.sh

source /tmp/sts-credentials.sh

Create Cloud Credentials Operator resources

RELEASE_IMAGE=$(openshift-install version | awk '/release image/ {print $3}') && echo "Release image: ${RELEASE_IMAGE}"

CCO_IMAGE=$(oc adm release info --image-for='cloud-credential-operator' $RELEASE_IMAGE -a /tmp/ocp_pullsecret.json) && echo $CCO_IMAGE

pushd ~/bin
oc image extract $CCO_IMAGE --file="/usr/bin/ccoctl" -a /tmp/ocp_pullsecret.json
popd
chmod 775 ~/bin/ccoctl

oc adm release extract --credentials-requests --cloud=aws --to=${AWS_CFG_DIR}/credrequests --from=$RELEASE_IMAGE

ccoctl aws create-all --name=${OCP_CLUSTER_NAME} --region=${AWS_REGION} --credentials-requests-dir=${AWS_CFG_DIR}/credrequests --output-dir=${AWS_CFG_DIR}/credoutput

Prepare OpenShift installation

mkdir -p ${AWS_CFG_DIR}/${OCP_CLUSTER_NAME}

cat << EOF > ${AWS_CFG_DIR}/${OCP_CLUSTER_NAME}/install-config.yaml
apiVersion: v1
baseDomain: ${OCP_DOMAIN_NAME}
credentialsMode: Manual
metadata:
  name: ${OCP_CLUSTER_NAME}

controlPlane:   
  hyperthreading: Enabled 
  name: master
  platform:
    aws:
      type: m5.xlarge
      zones:
      - ${AWS_REGION}a
  replicas: 3

compute: 
- hyperthreading: Enabled 
  name: worker
  platform:
    aws:
      type: m5.4xlarge
      zones:
      - ${AWS_REGION}a
  replicas: 3

networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16

platform:
  aws:
    region: ${AWS_REGION}

fips: false
pullSecret: '$(cat /tmp/ocp_pullsecret.json)'
sshKey: $(cat ~/.ssh/id_rsa.pub)
EOF

pushd ${AWS_CFG_DIR}/${OCP_CLUSTER_NAME}
openshift-install create manifests
popd

cp ${AWS_CFG_DIR}/credoutput/manifests/* ${AWS_CFG_DIR}/${OCP_CLUSTER_NAME}/manifests
cp -r ${AWS_CFG_DIR}/credoutput/tls ${AWS_CFG_DIR}/${OCP_CLUSTER_NAME}

Create OpenShift cluster

openshift-install create cluster --dir=${AWS_CFG_DIR}/${OCP_CLUSTER_NAME} --log-level=debug

Connect to OpenShift

export KUBECONFIG=${AWS_CFG_DIR}/${OCP_CLUSTER_NAME}/auth/kubeconfig

Install OpenShift Storage operator

oc create ns openshift-storage

cat << EOF | oc apply -f -
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: openshift-storage
  namespace: openshift-storage
spec:
  targetNamespaces:
  - openshift-storage
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/ocs-operator.openshift-storage: ""
  name: odf-operator
  namespace: openshift-storage
spec:
  channel: stable-4.12
  installPlanApproval: Automatic
  name: odf-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
EOF

Now wait until the OpenShift Data Foundation operator is ready.

watch "oc get csv -n openshift-storage -l operators.coreos.com/ocs-operator.openshift-storage --no-headers -o custom-columns='name:metadata.name,phase:status.phase'"

Patch OpenShift console

oc patch console.operator cluster \
    -n openshift-storage \
    --type json \
    -p '[{"op": "add", "path": "/spec/plugins", "value": ["odf-console"]}]'

Create storage cluster

cat << EOF | oc apply -f -
---
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
  annotations:
    uninstall.ocs.openshift.io/cleanup-policy: delete
    uninstall.ocs.openshift.io/mode: graceful
  name: ocs-storagecluster
  namespace: openshift-storage
spec:
  multiCloudGateway:
    dbStorageClassName: gp3-csi
    reconcileStrategy: standalone
EOF

Wait for the Storagecluster to reconcile. It never does because it fails to create the backingstore.

Create backingstore manually

cat << EOF | oc apply -f -
apiVersion: noobaa.io/v1alpha1
kind: BackingStore
metadata:
  name: noobaa-default-backing-store
  namespace: openshift-storage
spec:
  pvPool:
    numVolumes: 1
    resources:
      requests:
        storage: 100Gi
    secret: {}
    storageClass: gp3-csi
  type: pv-pool
EOF

Go to OpenShift console

echo "Go to console: https://$(oc get route --no-headers -n openshift-console console -o custom-columns='host:.spec.host')"
echo "Log in as kubeadmin, password $(cat ${AWS_CFG_DIR}/${OCP_CLUSTER_NAME}/auth/kubeadmin-password)"

Destroy OpenShift cluster

openshift-install destroy cluster --dir=${AWS_CFG_DIR}/${OCP_CLUSTER_NAME} --log-level=debug
fketelaars commented 8 months ago

We found a way to work around the current issue by creating the backingstore that is expected by the StorageCluster. The backingstore will be based on a PVC instead of AWS S3. This is not ideal, but will help us to progress with the provisioning of MCG.

fketelaars commented 8 months ago

This has been resolved by using OpenShift 4.14.

fketelaars commented 8 months ago

Issued reopened. The StorageCluster does get to a Ready state in OpenShift 4.14, but the BackingStorage stays in the BackingStorePhaseRejected state and no bucket is created for the cluster, meaning that any attempt to access the bucket fails.

Need to make the following changes:

  1. Create namespace with the correct label.

    apiVersion: v1
    kind: Namespace
    metadata:
    labels:
    openshift.io/cluster-monitoring: "true"
    name: openshift-storage
  2. Update CredentialsRequest to work with ServiceAccount:

    oc get credentialsrequest -n openshift-storage noobaa-aws-cloud-creds -o yaml > nooba-credreq.yaml
    NOOBA_BUCKET=$(cat nooba-credreq.yaml|grep arn:aws:s3:::|head -1|awk -F: '{print $7}')
    # add the following at the end
      # serviceAccountName:
      #   - noobaa
    ccoctl aws create-iam-roles --name="${OCP_CLUSTER_NAME}" --region="${AWS_REGION}" --credentials-requests-dir=. --identity-provider-arn=arn:aws:iam:: 872255850422:oidc-provider/${OCP_CLUSTER_NAME}-oidc.s3.${AWS_REGION}.amazonaws.com
    aws s3api create-bucket --bucket ${NOOBA_BUCKET} --region ${AWS_REGION} --create-bucket-configuration LocationConstraint=${AWS_REGION}
  3. Create BackingStore:

    cat <<EOF | oc apply -f -
    apiVersion: noobaa.io/v1alpha1
    kind: BackingStore
    metadata:
    finalizers:
    - noobaa.io/finalizer
    labels:
    app: noobaa
    name: noobaa-default-backing-store
    namespace: openshift-storage
    spec:
    awsS3:
    awsSTSRoleARN: arn:aws:iam:: 872255850422:oidc-provider/${OCP_CLUSTER_NAME}-oidc.s3.${AWS_REGION}.amazonaws.com
    targetBucket: ${NOOBA_BUCKET}
    secret:
      name: noobaa-aws-cloud-creds-secret
      namespace: openshift-storage
    pvPool:
    numVolumes: 1
    resources:
      requests:
        storage: 50Gi
    secret: {}
    storageClass: gp3-csi
    type: pv-pool
    EOF
barochiarg commented 3 weeks ago

Hi @fketelaars , is there any update on MCG issue? There are certain watsonx (WA, watsonx.ai, watsox.data) PoC request on HCP. Due to MCG issue, we are not able to move on this.