dell / csm

Dell Container Storage Modules (CSM)
Apache License 2.0
70 stars 15 forks source link

[BUG]: CSI-PowerFlex CSM object stays stuck in failed state when driver deployment succeeds #1137

Closed jooseppi-luna closed 9 months ago

jooseppi-luna commented 9 months ago

Bug Description

While running csm-operator e2e tests for CSM v1.9.2 release, we found that the csm object associated with the powerflex driver intermittently stays stuck in failed state even when all the pods go into running state.

Logs

`[root@master-1-Zaglt7mQUY8Wg e2e]# k describe csm -n test-vxflexos test-vxflexos Name: test-vxflexos Namespace: test-vxflexos Labels: Annotations: storage.dell.com/CSMOperatorConfigVersion: v2.9.1 storage.dell.com/CSMVersion: v1.9.2 storage.dell.com/PreviouslyAppliedConfiguration: {"kind":"ContainerStorageModule","apiVersion":"storage.dell.com/v1","metadata":{"name":"test-vxflexos","namespace":"test-vxflexos","uid":"... API Version: storage.dell.com/v1 Kind: ContainerStorageModule Metadata: Creation Timestamp: 2024-02-09T15:18:37Z Finalizers: finalizer.dell.emc.com Generation: 2 Resource Version: 30157378 UID: ccfc9562-404b-45a3-b624-bb8edcd08438 Spec: Driver: Common: Envs: Name: X_CSI_VXFLEXOS_ENABLELISTVOLUMESNAPSHOT Value: false Name: X_CSI_VXFLEXOS_ENABLESNAPSHOTCGDELETE Value: false Name: X_CSI_DEBUG Value: true Name: X_CSI_ALLOW_RWO_MULTI_POD_ACCESS Value: false Name: KUBELET_CONFIG_DIR Value: /var/lib/kubelet Name: CERT_SECRET_COUNT Value: 0 Name: X_CSI_QUOTA_ENABLED Value: false Image: dellemc/csi-vxflexos:nightly Image Pull Policy: IfNotPresent Config Version: v2.9.1 Controller: Envs: Name: X_CSI_HEALTH_MONITOR_ENABLED Value: false Csi Driver Spec: F S Group Policy: File Storage Capacity: true Csi Driver Type: powerflex Dns Policy: ClusterFirstWithHostNet Force Remove Driver: true Init Containers: Envs: Name: MDM Value: 10.XXX.XX.XXX,10.XXX.XX.XXX Image: dellemc/sdc:4.5 Image Pull Policy: IfNotPresent Name: sdc Node: Envs: Name: X_CSI_HEALTH_MONITOR_ENABLED Value: false Name: X_CSI_APPROVE_SDC_ENABLED Value: false Name: X_CSI_RENAME_SDC_ENABLED Value: false Name: X_CSI_MAX_VOLUMES_PER_NODE Value: 0 Name: X_CSI_RENAME_SDC_PREFIX Replicas: 1 Side Cars: Enabled: false Envs: Name: HOST_PID Value: 1 Name: MDM Value: 10.XXX.XX.XXX,10.XXX.XX.XXX Image: dellemc/sdc:4.5 Name: sdc-monitor Args: --monitor-interval=60s Enabled: false Name: csi-external-health-monitor-controller Modules: Components: Envs: Name: PROXY_HOST Value: authorization-ingress-nginx-controller.authorization.svc.cluster.local Name: SKIP_CERTIFICATE_VALIDATION Value: true Image: dellemc/csm-authorization-sidecar:nightly Name: karavi-authorization-proxy Config Version: v1.9.1 Enabled: false Name: authorization Components: Enabled: false Envs: Name: TOPOLOGY_LOG_LEVEL Value: INFO Image: dellemc/csm-topology:nightly Name: topology Enabled: false Envs: Name: NGINX_PROXY_IMAGE Value: nginxinc/nginx-unprivileged:1.20 Image: otel/opentelemetry-collector:0.42.0 Name: otel-collector Enabled: false Envs: Name: POWERFLEX_MAX_CONCURRENT_QUERIES Value: 10 Name: POWERFLEX_SDC_METRICS_ENABLED Value: true Name: POWERFLEX_VOLUME_METRICS_ENABLED Value: true Name: POWERFLEX_STORAGE_POOL_METRICS_ENABLED Value: true Name: POWERFLEX_SDC_IO_POLL_FREQUENCY Value: 10 Name: POWERFLEX_VOLUME_IO_POLL_FREQUENCY Value: 10 Name: POWERFLEX_STORAGE_POOL_POLL_FREQUENCY Value: 10 Name: POWERFLEX_LOG_LEVEL Value: INFO Name: POWERFLEX_LOG_FORMAT Value: TEXT Name: COLLECTOR_ADDRESS Value: otel-collector:55680 Image: dellemc/csm-metrics-powerflex:nightly Name: metrics-powerflex Config Version: v1.7.0 Enabled: false Name: observability Components: Envs: Name: X_CSI_REPLICATION_PREFIX Value: replication.storage.dell.com Name: X_CSI_REPLICATION_CONTEXT_PREFIX Value: powerflex Image: dellemc/dell-csi-replicator:v1.4.0 Name: dell-csi-replicator Envs: Name: TARGET_CLUSTERS_IDS Value: self Name: REPLICATION_CTRL_LOG_LEVEL Value: debug Name: REPLICATION_CTRL_REPLICAS Value: 1 Name: RETRY_INTERVAL_MIN Value: 1s Name: RETRY_INTERVAL_MAX Value: 5m Image: dellemc/dell-replication-controller:v1.4.0 Name: dell-replication-controller-manager Image: dellemc/dell-replication-init:v1.0.0 Name: dell-replication-controller-init Config Version: v1.4.0 Enabled: false Name: replication Components: Args: --csisock=unix:/var/run/csi/csi.sock --labelvalue=csi-vxflexos --mode=controller --skipArrayConnectionValidation=false --driver-config-params=/vxflexos-config-params/driver-config-params.yaml --driverPodLabelValue=dell-storage --ignoreVolumelessPods=false Image: dellemc/podmon:nightly Image Pull Policy: IfNotPresent Name: podmon-controller Args: --csisock=unix:/var/lib/kubelet/plugins/vxflexos.emc.dell.com/csi_sock --labelvalue=csi-vxflexos --mode=node --leaderelection=false --driver-config-params=/vxflexos-config-params/driver-config-params.yaml --driverPodLabelValue=dell-storage --ignoreVolumelessPods=false Envs: Name: X_CSI_PODMON_API_PORT Value: 8083 Image: dellemc/podmon:nightly Image Pull Policy: IfNotPresent Name: podmon-node Config Version: v1.8.1 Enabled: false Name: resiliency Status: Controller Status: Available: 0 Desired: 1 Failed: 1 Node Status: Available: 2 Desired: 2 Failed: 0 State: Failed Events: Type Reason Age From Message


Normal Updated 3m12s csm Object finalizer is added Normal Completed 3m11s (x2 over 3m11s) csm install/update storage component: test-vxflexos completed OK Normal Completed 3m11s csm Driver deployment running OK Warning Updated 3m11s csm at 1707491918292836118 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s csm at 1707491918387122708 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s (x3 over 3m11s) csm Failed install: Operation cannot be fulfilled on containerstoragemodules.storage.dell.com "test-vxflexos": the object has been modified; please apply your changes to the latest version and try again Warning Updated 3m11s csm at 1707491918502243242 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s csm at 1707491918603289132 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s csm at 1707491918633307657 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s csm at 1707491918702738940 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m10s csm at 1707491919356615788 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m10s csm at 1707491919773768318 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m9s (x3 over 3m9s) csm (combined from similar events): at 1707491920766553739 Pod error details error message for default-source-cluster PodInitializing= Normal Completed 3m8s csm at 1707491921372865165 Driver pods running OK Normal Completed 3m8s csm Driver daemonset running OK [root@master-1-Zaglt7mQUY8Wg e2e]# k get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE authorization authorization-ingress-nginx-controller-58cdf8bb96-4gxz5 1/1 Running 0 4m4s authorization cert-manager-765754f9cd-vhpgj 1/1 Running 0 4m4s authorization cert-manager-cainjector-759bbd747b-797kp 1/1 Running 0 4m4s authorization cert-manager-webhook-6fd48c65c8-nhf5t 1/1 Running 0 4m4s authorization proxy-server-5755f8cbdd-hx4t7 3/3 Running 0 4m5s authorization redis-commander-5475c6469b-r9znn 1/1 Running 0 4m5s authorization redis-primary-76c94759c4-gsg24 1/1 Running 0 4m5s authorization role-service-5c945689cb-mh97b 1/1 Running 0 4m5s authorization storage-service-56db7c6fbf-bk9lf 1/1 Running 0 4m5s authorization tenant-service-58dd6ff68c-sdspk 1/1 Running 0 4m5s dell-csm-operator dell-csm-operator-controller-manager-884f7cb9b-kmqrm 1/1 Running 0 170m kube-flannel kube-flannel-ds-5vmqz 1/1 Running 0 111d kube-flannel kube-flannel-ds-nwfs9 1/1 Running 0 111d kube-flannel kube-flannel-ds-z2875 1/1 Running 1 (79d ago) 111d kube-system coredns-5d78c9869d-5p48t 1/1 Running 1 (79d ago) 111d kube-system coredns-5d78c9869d-kkntt 1/1 Running 1 (79d ago) 111d kube-system etcd-master-1-zaglt7mquy8wg.domain 1/1 Running 2 (59d ago) 111d kube-system kube-apiserver-master-1-zaglt7mquy8wg.domain 1/1 Running 2 (59d ago) 111d kube-system kube-controller-manager-master-1-zaglt7mquy8wg.domain 1/1 Running 1 (79d ago) 111d kube-system kube-proxy-gf2ws 1/1 Running 0 111d kube-system kube-proxy-k2sch 1/1 Running 1 (79d ago) 111d kube-system kube-proxy-tq6nr 1/1 Running 0 111d kube-system kube-scheduler-master-1-zaglt7mquy8wg.domain 1/1 Running 1 (79d ago) 111d kube-system snapshot-controller-55687d7977-2lhqg 1/1 Running 1 (79d ago) 83d kube-system snapshot-controller-55687d7977-x46c6 1/1 Running 0 83d minio minio-0 1/1 Running 0 98d minio minio-1 1/1 Running 0 98d minio minio-2 1/1 Running 0 98d minio minio-3 1/1 Running 0 98d test-vxflexos test-vxflexos-controller-797f95f7c7-xfs7r 5/5 Running 0 3m24s test-vxflexos test-vxflexos-node-js29j 2/2 Running 0 3m24s test-vxflexos test-vxflexos-node-pg75s 2/2 Running 0 3m24s vxflexos vxflexos-controller-99fc4778f-clnzw 5/5 Running 5 (47m ago) 20h vxflexos vxflexos-node-9f6vm 2/2 Running 0 20h vxflexos vxflexos-node-gvrm9 2/2 Running 0 20h [root@master-1-Zaglt7mQUY8Wg e2e]# k get csm -A NAMESPACE NAME CREATIONTIME CSIDRIVERTYPE CONFIGVERSION STATE authorization authorization 5m Failed test-vxflexos test-vxflexos 4m19s powerflex v2.9.1 Failed [root@master-1-Zaglt7mQUY8Wg e2e]#`

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

Install PowerFlex repeatedly until the CSM stays stuck in a failed state.

Expected Behavior

CSM object should go into Success state once the

CSM Driver(s)

CSI Driver for PowerFlex v2.9.1

Installation Type

CSM-Operator v1.4.1

Container Storage Modules Enabled

No response

Container Orchestrator

Kubernetes v1.27.2

Operating System

RHEL 8.9

jooseppi-luna commented 9 months ago

Fixed in v1.9.2