IBM / cloud-pak-deployer

Configuration-based installation of OpenShift and Cloud Pak for Data/Integration/Watson AIOps on various private and public cloud infrastructure providers. Deployment attempts to achieve the end-state defined in the configuration. If something fails along the way, you only need to restart the process to continue the deployment.
https://ibm.github.io/cloud-pak-deployer/
Apache License 2.0
140 stars 69 forks source link

AWS EFS Issues #799

Closed Alan111S closed 1 month ago

Alan111S commented 1 month ago

Describe the bug On AWS EFS, dynamic provisioning is broken - New PVCs are stuck in Pending state and the associated PVs are never created.

To Reproduce The cluster being used was setup using the Deployer originally. Running a WKC install, noticed that wkc-db2u-backup PV wasn't created and the associated PVC was stuck in Pending state.

Expected behavior PVC should be able to bind to a PV OK

Errors in the nfs-client-provisioner

E0930 11:13:33.328981 1 reflector.go:380] pkg/mod/k8s.io/client-go@v0.18.0/tools/cache/reflector.go:125: Failed to watch *v1.StorageClass: unknown (get storageclasses.storage.k8s.io)
E0930 11:13:36.726137 1 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.0/tools/cache/reflector.go:125: Failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:serviceaccount:cp-nfs:nfs-client-provisioner" cannot list resource "persistentvolumes" in API group "" at the cluster scope
E0930 11:13:54.394213 1 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.0/tools/cache/reflector.go:125: Failed to list *v1.PersistentVolumeClaim: persistentvolumeclaims is forbidden: User "system:serviceaccount:cp-nfs:nfs-client-provisioner" cannot list resource "persistentvolumeclaims" in API group "" at the cluster scope
E0930 11:14:08.784226 1 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.0/tools/cache/reflector.go:125: Failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:serviceaccount:cp-nfs:nfs-client-provisioner" cannot list resource "persistentvolumes" in API group "" at the cluster scope

Workaround Recreated the Storage Class efs-nfs-client

oc delete sc efs-nfs-client --force
export EFS_LOCATION=fs-0cabc99dad34ddf03.efs.ap-southeast-2.amazonaws.com
export EFS_PATH=/
export PROJECT_NFS_PROVISIONER=nfs-provisioner
export EFS_STORAGE_CLASS=efs-nfs-client
export NFS_IMAGE=registry.k8s.io/sig-storage/nfs-subdir-external-provisioner:v4.0.2
cpd-cli manage setup-nfs-provisioner --nfs_server=${EFS_LOCATION} --nfs_path=${EFS_PATH} --nfs_provisioner_ns=${PROJECT_NFS_PROVISIONER} --nfs_storageclass_name=${EFS_STORAGE_CLASS} --nfs_provisioner_image=${NFS_IMAGE}

After this the new nfs-client-provisioner had a log like this:-

I0930 10:34:28.593161 1 leaderelection.go:242] attempting to acquire leader lease nfs-provisioner/k8s-sigs.io-nfs-subdir-external-provisioner...
I0930 10:34:46.216229 1 leaderelection.go:252] successfully acquired lease nfs-provisioner/k8s-sigs.io-nfs-subdir-external-provisioner
I0930 10:34:46.216281 1 event.go:278] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"nfs-provisioner", Name:"k8s-sigs.io-nfs-subdir-external-provisioner", UID:"82a3045d-c92e-4958-8c83-8ef093ce82f4", APIVersion:"v1", ResourceVersion:"58507112", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' nfs-client-provisioner-66f9c9b666-mfzbw_1f99018b-10fe-4e9a-a319-b7a636bf510d became leader
I0930 10:34:46.216404 1 controller.go:820] Starting provisioner controller k8s-sigs.io/nfs-subdir-external-provisioner_nfs-client-provisioner-66f9c9b666-mfzbw_1f99018b-10fe-4e9a-a319-b7a636bf510d!
I0930 10:34:46.316668 1 controller.go:869] Started provisioner controller k8s-sigs.io/nfs-subdir-external-provisioner_nfs-client-provisioner-66f9c9b666-mfzbw_1f99018b-10fe-4e9a-a319-b7a636bf510d!
I0930 10:40:56.463145 1 controller.go:1317] provision "dm/datastage-ibm-datastage-ds-migration-pvc" class "efs-nfs-client": started
I0930 10:40:56.463275 1 controller.go:1317] provision "dm/ds-px-default-ibm-datastage-px-storage-pvc" class "efs-nfs-client": started
I0930 10:40:56.464471 1 controller.go:1317] provision "dm/datastage-ibm-datastage-ds-storage-pvc" class "efs-nfs-client": started
I0930 10:40:56.473270 1 event.go:278] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dm", Name:"datastage-ibm-datastage-ds-migration-pvc", UID:"e367180f-0fac-4241-86dd-b04460b78144", APIVersion:"v1", ResourceVersion:"58512116", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "dm/datastage-ibm-datastage-ds-migration-pvc"
I0930 10:40:56.473335 1 event.go:278] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dm", Name:"datastage-ibm-datastage-ds-storage-pvc", UID:"9dd87d27-e9e9-44b2-92af-cae22cecac24", APIVersion:"v1", ResourceVersion:"58512121", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "dm/datastage-ibm-datastage-ds-storage-pvc"
I0930 10:40:56.473553 1 event.go:278] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dm", Name:"ds-px-default-ibm-datastage-px-storage-pvc", UID:"2641a4a2-cadf-4d9a-98b3-7c1a3b74f42d", APIVersion:"v1", ResourceVersion:"58512117", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "dm/ds-px-default-ibm-datastage-px-storage-pvc"
I0930 10:40:56.494499 1 controller.go:1420] provision "dm/ds-px-default-ibm-datastage-px-storage-pvc" class "efs-nfs-client": volume "pvc-2641a4a2-cadf-4d9a-98b3-7c1a3b74f42d" provisioned
I0930 10:40:56.494553 1 controller.go:1437] provision "dm/ds-px-default-ibm-datastage-px-storage-pvc" class "efs-nfs-client": succeeded
I0930 10:40:56.494560 1 volume_store.go:212] Trying to save persistentvolume "pvc-2641a4a2-cadf-4d9a-98b3-7c1a3b74f42d"
I0930 10:40:56.503122 1 volume_store.go:219] persistentvolume "pvc-2641a4a2-cadf-4d9a-98b3-7c1a3b74f42d" saved
I0930 10:40:56.503276 1 event.go:278] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dm", Name:"ds-px-default-ibm-datastage-px-storage-pvc", UID:"2641a4a2-cadf-4d9a-98b3-7c1a3b74f42d", APIVersion:"v1", ResourceVersion:"58512117", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-2641a4a2-cadf-4d9a-98b3-7c1a3b74f42d
I0930 10:40:56.504922 1 controller.go:1420] provision "dm/datastage-ibm-datastage-ds-storage-pvc" class "efs-nfs-client": volume "pvc-9dd87d27-e9e9-44b2-92af-cae22cecac24" provisioned
I0930 10:40:56.504969 1 controller.go:1437] provision "dm/datastage-ibm-datastage-ds-storage-pvc" class "efs-nfs-client": succeeded
I0930 10:40:56.505019 1 volume_store.go:212] Trying to save persistentvolume "pvc-9dd87d27-e9e9-44b2-92af-cae22cecac24"
I0930 10:40:56.512944 1 controller.go:1420] provision "dm/datastage-ibm-datastage-ds-migration-pvc" class "efs-nfs-client": volume "pvc-e367180f-0fac-4241-86dd-b04460b78144" provisioned
I0930 10:40:56.512985 1 controller.go:1437] provision "dm/datastage-ibm-datastage-ds-migration-pvc" class "efs-nfs-client": succeeded
I0930 10:40:56.512992 1 volume_store.go:212] Trying to save persistentvolume "pvc-e367180f-0fac-4241-86dd-b04460b78144"
I0930 10:40:56.515080 1 volume_store.go:219] persistentvolume "pvc-9dd87d27-e9e9-44b2-92af-cae22cecac24" saved
I0930 10:40:56.515168 1 event.go:278] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dm", Name:"datastage-ibm-datastage-ds-storage-pvc", UID:"9dd87d27-e9e9-44b2-92af-cae22cecac24", APIVersion:"v1", ResourceVersion:"58512121", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-9dd87d27-e9e9-44b2-92af-cae22cecac24
I0930 10:40:56.521786 1 volume_store.go:219] persistentvolume "pvc-e367180f-0fac-4241-86dd-b04460b78144" saved
I0930 10:40:56.521828 1 event.go:278] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"dm", Name:"datastage-ibm-datastage-ds-migration-pvc", UID:"e367180f-0fac-4241-86dd-b04460b78144", APIVersion:"v1", ResourceVersion:"58512116", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-e367180f-0fac-4241-86dd-b04460b78144
I0930 10:42:05.935488 1 controller.go:1450] delete "pvc-cfdf5de9-4043-4a23-8181-60df8244e827": started
W0930 10:42:05.943061 1 provisioner.go:146] path /persistentvolumes/nfs-provisioner-sc-test-pvc-pvc-cfdf5de9-4043-4a23-8181-60df8244e827 does not exist, deletion skipped
I0930 10:42:05.943208 1 controller.go:1478] delete "pvc-cfdf5de9-4043-4a23-8181-60df8244e827": volume deleted
I0930 10:42:05.954267 1 controller.go:1524] delete "pvc-cfdf5de9-4043-4a23-8181-60df8244e827": persistentvolume deleted
I0930 10:42:05.954310 1 controller.go:1526] delete "pvc-cfdf5de9-4043-4a23-8181-60df8244e827": succeeded
Alan111S commented 1 month ago

I checked whether this is happening in another cluster and it has the same issue. My guess is that something has changed in OpenShift to cause the Storage Class to not work properly

Alan111S commented 1 month ago

Old Storage Class settings:-

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-nfs-client
  uid: 485f6bf9-c23c-4194-b3c2-331d45b3cc90
  resourceVersion: '31155'
  creationTimestamp: '2024-07-23T04:54:51Z'
  managedFields:
    - manager: kubectl-create
      operation: Update
      apiVersion: storage.k8s.io/v1
      time: '2024-07-23T04:54:51Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:parameters':
          .: {}
          'f:archiveOnDelete': {}
        'f:provisioner': {}
        'f:reclaimPolicy': {}
        'f:volumeBindingMode': {}
provisioner: nfs-storage
parameters:
  archiveOnDelete: 'false'
reclaimPolicy: Delete
volumeBindingMode: Immediate

New Storage Class settings after recreating using cpd-cli manage setup-nfs-provsioner

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-nfs-client
  uid: 4b71e37a-382c-4636-8d4b-815596c31146
  resourceVersion: '58512091'
  creationTimestamp: '2024-09-30T10:40:54Z'
  managedFields:
    - manager: OpenAPI-Generator
      operation: Update
      apiVersion: storage.k8s.io/v1
      time: '2024-09-30T10:40:54Z'
      fieldsType: FieldsV1
      fieldsV1:
        'f:parameters':
          .: {}
          'f:archiveOnDelete': {}
        'f:provisioner': {}
        'f:reclaimPolicy': {}
        'f:volumeBindingMode': {}
provisioner: k8s-sigs.io/nfs-subdir-external-provisioner
parameters:
  archiveOnDelete: 'false'
reclaimPolicy: Delete
volumeBindingMode: Immediate

The only difference is the provisioner is set to 'k8s-sigs.io/nfs-subdir-external-provisioner' instead of 'nfs-storage'

Alan111S commented 1 month ago

! RESOLVED !

I found out that I added another NFS storage namespace which overwrote the ClusterRoles and ClusterRoleBindings originally setup by the Deployer. This caused permissions to create PVs by the original service account to be lost