NetApp / trident

Storage orchestrator for containers
Apache License 2.0
732 stars 218 forks source link

TridentBackendConfig doesn't get deleted #876

Open sontivr opened 7 months ago

sontivr commented 7 months ago

Describe the bug A clear and concise description of what the bug is.

Hello,

I am trying to test FSxONTAP filesystem with iSCSI protocol for persistent volumes to deploy Victoria Metrics time series database into EKS cluster. I am following Run containerized applications efficiently using Amazon FSx for NetApp ONTAP and Amazon EKS with some support from AWS. At some point, I tried to delete TridentBackupConfig to start all over again. It seems to get stuck in Deleting phase forever. The documentation does say that it stays in Deleting phase when it has dependent objects. I have uninstalled the workload and tried to delete PVs/PVCs created using this tbc, but it didn’t help. PVCs got deleted and the PVs got stuck in the Terminating state. What else is included in the backend components? Should I be deleting FSxONTAP filesystem itself for me to be able to clean up the tbc? What if I can’t afford to lose my persistent volumes? Is FSxONTAP+iSCSI recommended for the workloads like Victoria Metrics database deployed into the EKS clusters?

kt get tbc
NAME                    BACKEND NAME            BACKEND UUID                           PHASE      STATUS
backend-fsx-ontap-san   backend-fsx-ontap-san   949563cb-6717-4455-a778-7fb16c906630   Deleting   Success

kt get tbc backend-fsx-ontap-san -o yaml
apiVersion: trident.netapp.io/v1
kind: TridentBackendConfig
metadata:
  creationTimestamp: "2023-12-03T20:07:27Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2023-12-04T21:17:33Z"
  finalizers:
  - trident.netapp.io
  generation: 2
  name: backend-fsx-ontap-san
  namespace: trident
  resourceVersion: "1049263"
  uid: d835c761-9317-4171-bc22-540b9d5ce864
spec:
  credentials:
    name: backend-fsx-ontap-san-secret
  managementLIF: 198.19.255.172
  storageDriverName: ontap-san
  svm: ekssvm
  version: 1
status:
  backendInfo:
    backendName: backend-fsx-ontap-san
    backendUUID: 949563cb-6717-4455-a778-7fb16c906630
  deletionPolicy: delete
  lastOperationStatus: Success
  message: 'Backend is in a deleting state, cannot proceed with the TridentBackendConfig
    deletion. '
  phase: Deleting

ko get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS        CLAIM                                                                 STORAGECLASS      REASON   AGE
pvc-0462c1f0-0a54-43c5-8b1b-7bd3fa6fb205   50Gi       RWO            Delete           Terminating   observability/mysql-volume                                            fsx-basic-block            3d2h
pvc-d2ea4c54-e23d-4a95-b35e-68fd85989937   50Gi       RWO            Delete           Terminating   observability/vmstorage-volume-victoria-metrics-cluster-vmstorage-0   fsx-basic-block            2d21h
pvc-ef602c9c-4a27-4d6d-a542-55e470d2553f   50Gi       RWO            Delete           Terminating   observability/vmstorage-volume-victoria-metrics-cluster-vmstorage-1   fsx-basic-block            2d21h

Environment Provide accurate information about the environment to help us reproduce the issue.

To Reproduce kubectl delete tbc backend-fsx-ontap-san

Expected behavior A clear and concise description of what you expected to happen. backend-fsx-ontap-san should. be deleted

Additional context Add any other context about the problem here.

wonderland commented 7 months ago

The PVs stuck in Terminating are probably the dependency that keeps the TridentBackend from deleting. Can you do a kubectl describe on one of them to see if there is anything helpful on why they are stuck?

Besides that, Trident support multiple backends in parallel. So even the current one still is in Deleting state, you can just add a new backend (or multiple, if you like).

sontivr commented 7 months ago

Thanks for looking into it @wonderland. I don't see anything popping out from the describe output. I did notice that creating another backend with a different name does work. It is just that leaving some objects in hung state makes me nervous about the health of the system.

k describe pv pvc-d2ea4c54-e23d-4a95-b35e-68fd85989937 
Name:            pvc-d2ea4c54-e23d-4a95-b35e-68fd85989937
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: csi.trident.netapp.io
                 volume.kubernetes.io/provisioner-deletion-secret-name: 
                 volume.kubernetes.io/provisioner-deletion-secret-namespace: 
Finalizers:      [external-attacher/csi-trident-netapp-io]
StorageClass:    fsx-basic-block
Status:          Terminating (lasts 3d5h)
Claim:           observability/vmstorage-volume-victoria-metrics-cluster-vmstorage-0
Reclaim Policy:  Delete
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        50Gi
Node Affinity:   <none>
Message:         
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            csi.trident.netapp.io
    FSType:            ext4
    VolumeHandle:      pvc-d2ea4c54-e23d-4a95-b35e-68fd85989937
    ReadOnly:          false
    VolumeAttributes:      backendUUID=949563cb-6717-4455-a778-7fb16c906630
                           internalName=trident_pvc_d2ea4c54_e23d_4a95_b35e_68fd85989937
                           name=pvc-d2ea4c54-e23d-4a95-b35e-68fd85989937
                           protocol=block
                           storage.kubernetes.io/csiProvisionerIdentity=1701633261584-3547-csi.trident.netapp.io
Events:                <none>
amej commented 6 months ago

Every kubernetes PV has an associated "tvol" custom kubernetes resource created in 'trident' namespace.
"oc describe tvol ..." could give you hints. Another location to look at it is the trident-controller logs.

jamessevener commented 6 months ago

That finalizer Finalizers: [external-attacher/csi-trident-netapp-io] would be what's holding it up.

balaramesh commented 5 months ago

@sontivr it looks like for some reason a PV was stranded, and that is holding you back from deleting the TBC. As @wonderland mentioned, you could just go ahead and create a new TBC to get around this. To clean the old TBC up, you will need to remove the finalizer (@jamessevener, thank you :)). Before you do that, please make sure that this PV is not associated with a PVC and is not being used by a workload. That should help you resolve your issue.