Closed jooseppi-luna closed 9 months ago
While running csm-operator e2e tests for CSM v1.9.2 release, we found that the csm object associated with the powerflex driver intermittently stays stuck in failed state even when all the pods go into running state.
`[root@master-1-Zaglt7mQUY8Wg e2e]# k describe csm -n test-vxflexos test-vxflexos Name: test-vxflexos Namespace: test-vxflexos Labels: Annotations: storage.dell.com/CSMOperatorConfigVersion: v2.9.1 storage.dell.com/CSMVersion: v1.9.2 storage.dell.com/PreviouslyAppliedConfiguration: {"kind":"ContainerStorageModule","apiVersion":"storage.dell.com/v1","metadata":{"name":"test-vxflexos","namespace":"test-vxflexos","uid":"... API Version: storage.dell.com/v1 Kind: ContainerStorageModule Metadata: Creation Timestamp: 2024-02-09T15:18:37Z Finalizers: finalizer.dell.emc.com Generation: 2 Resource Version: 30157378 UID: ccfc9562-404b-45a3-b624-bb8edcd08438 Spec: Driver: Common: Envs: Name: X_CSI_VXFLEXOS_ENABLELISTVOLUMESNAPSHOT Value: false Name: X_CSI_VXFLEXOS_ENABLESNAPSHOTCGDELETE Value: false Name: X_CSI_DEBUG Value: true Name: X_CSI_ALLOW_RWO_MULTI_POD_ACCESS Value: false Name: KUBELET_CONFIG_DIR Value: /var/lib/kubelet Name: CERT_SECRET_COUNT Value: 0 Name: X_CSI_QUOTA_ENABLED Value: false Image: dellemc/csi-vxflexos:nightly Image Pull Policy: IfNotPresent Config Version: v2.9.1 Controller: Envs: Name: X_CSI_HEALTH_MONITOR_ENABLED Value: false Csi Driver Spec: F S Group Policy: File Storage Capacity: true Csi Driver Type: powerflex Dns Policy: ClusterFirstWithHostNet Force Remove Driver: true Init Containers: Envs: Name: MDM Value: 10.XXX.XX.XXX,10.XXX.XX.XXX Image: dellemc/sdc:4.5 Image Pull Policy: IfNotPresent Name: sdc Node: Envs: Name: X_CSI_HEALTH_MONITOR_ENABLED Value: false Name: X_CSI_APPROVE_SDC_ENABLED Value: false Name: X_CSI_RENAME_SDC_ENABLED Value: false Name: X_CSI_MAX_VOLUMES_PER_NODE Value: 0 Name: X_CSI_RENAME_SDC_PREFIX Replicas: 1 Side Cars: Enabled: false Envs: Name: HOST_PID Value: 1 Name: MDM Value: 10.XXX.XX.XXX,10.XXX.XX.XXX Image: dellemc/sdc:4.5 Name: sdc-monitor Args: --monitor-interval=60s Enabled: false Name: csi-external-health-monitor-controller Modules: Components: Envs: Name: PROXY_HOST Value: authorization-ingress-nginx-controller.authorization.svc.cluster.local Name: SKIP_CERTIFICATE_VALIDATION Value: true Image: dellemc/csm-authorization-sidecar:nightly Name: karavi-authorization-proxy Config Version: v1.9.1 Enabled: false Name: authorization Components: Enabled: false Envs: Name: TOPOLOGY_LOG_LEVEL Value: INFO Image: dellemc/csm-topology:nightly Name: topology Enabled: false Envs: Name: NGINX_PROXY_IMAGE Value: nginxinc/nginx-unprivileged:1.20 Image: otel/opentelemetry-collector:0.42.0 Name: otel-collector Enabled: false Envs: Name: POWERFLEX_MAX_CONCURRENT_QUERIES Value: 10 Name: POWERFLEX_SDC_METRICS_ENABLED Value: true Name: POWERFLEX_VOLUME_METRICS_ENABLED Value: true Name: POWERFLEX_STORAGE_POOL_METRICS_ENABLED Value: true Name: POWERFLEX_SDC_IO_POLL_FREQUENCY Value: 10 Name: POWERFLEX_VOLUME_IO_POLL_FREQUENCY Value: 10 Name: POWERFLEX_STORAGE_POOL_POLL_FREQUENCY Value: 10 Name: POWERFLEX_LOG_LEVEL Value: INFO Name: POWERFLEX_LOG_FORMAT Value: TEXT Name: COLLECTOR_ADDRESS Value: otel-collector:55680 Image: dellemc/csm-metrics-powerflex:nightly Name: metrics-powerflex Config Version: v1.7.0 Enabled: false Name: observability Components: Envs: Name: X_CSI_REPLICATION_PREFIX Value: replication.storage.dell.com Name: X_CSI_REPLICATION_CONTEXT_PREFIX Value: powerflex Image: dellemc/dell-csi-replicator:v1.4.0 Name: dell-csi-replicator Envs: Name: TARGET_CLUSTERS_IDS Value: self Name: REPLICATION_CTRL_LOG_LEVEL Value: debug Name: REPLICATION_CTRL_REPLICAS Value: 1 Name: RETRY_INTERVAL_MIN Value: 1s Name: RETRY_INTERVAL_MAX Value: 5m Image: dellemc/dell-replication-controller:v1.4.0 Name: dell-replication-controller-manager Image: dellemc/dell-replication-init:v1.0.0 Name: dell-replication-controller-init Config Version: v1.4.0 Enabled: false Name: replication Components: Args: --csisock=unix:/var/run/csi/csi.sock --labelvalue=csi-vxflexos --mode=controller --skipArrayConnectionValidation=false --driver-config-params=/vxflexos-config-params/driver-config-params.yaml --driverPodLabelValue=dell-storage --ignoreVolumelessPods=false Image: dellemc/podmon:nightly Image Pull Policy: IfNotPresent Name: podmon-controller Args: --csisock=unix:/var/lib/kubelet/plugins/vxflexos.emc.dell.com/csi_sock --labelvalue=csi-vxflexos --mode=node --leaderelection=false --driver-config-params=/vxflexos-config-params/driver-config-params.yaml --driverPodLabelValue=dell-storage --ignoreVolumelessPods=false Envs: Name: X_CSI_PODMON_API_PORT Value: 8083 Image: dellemc/podmon:nightly Image Pull Policy: IfNotPresent Name: podmon-node Config Version: v1.8.1 Enabled: false Name: resiliency Status: Controller Status: Available: 0 Desired: 1 Failed: 1 Node Status: Available: 2 Desired: 2 Failed: 0 State: Failed Events: Type Reason Age From Message
Normal Updated 3m12s csm Object finalizer is added Normal Completed 3m11s (x2 over 3m11s) csm install/update storage component: test-vxflexos completed OK Normal Completed 3m11s csm Driver deployment running OK Warning Updated 3m11s csm at 1707491918292836118 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s csm at 1707491918387122708 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s (x3 over 3m11s) csm Failed install: Operation cannot be fulfilled on containerstoragemodules.storage.dell.com "test-vxflexos": the object has been modified; please apply your changes to the latest version and try again Warning Updated 3m11s csm at 1707491918502243242 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s csm at 1707491918603289132 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s csm at 1707491918633307657 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s csm at 1707491918702738940 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m10s csm at 1707491919356615788 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m10s csm at 1707491919773768318 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m9s (x3 over 3m9s) csm (combined from similar events): at 1707491920766553739 Pod error details error message for default-source-cluster PodInitializing= Normal Completed 3m8s csm at 1707491921372865165 Driver pods running OK Normal Completed 3m8s csm Driver daemonset running OK [root@master-1-Zaglt7mQUY8Wg e2e]# k get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE authorization authorization-ingress-nginx-controller-58cdf8bb96-4gxz5 1/1 Running 0 4m4s authorization cert-manager-765754f9cd-vhpgj 1/1 Running 0 4m4s authorization cert-manager-cainjector-759bbd747b-797kp 1/1 Running 0 4m4s authorization cert-manager-webhook-6fd48c65c8-nhf5t 1/1 Running 0 4m4s authorization proxy-server-5755f8cbdd-hx4t7 3/3 Running 0 4m5s authorization redis-commander-5475c6469b-r9znn 1/1 Running 0 4m5s authorization redis-primary-76c94759c4-gsg24 1/1 Running 0 4m5s authorization role-service-5c945689cb-mh97b 1/1 Running 0 4m5s authorization storage-service-56db7c6fbf-bk9lf 1/1 Running 0 4m5s authorization tenant-service-58dd6ff68c-sdspk 1/1 Running 0 4m5s dell-csm-operator dell-csm-operator-controller-manager-884f7cb9b-kmqrm 1/1 Running 0 170m kube-flannel kube-flannel-ds-5vmqz 1/1 Running 0 111d kube-flannel kube-flannel-ds-nwfs9 1/1 Running 0 111d kube-flannel kube-flannel-ds-z2875 1/1 Running 1 (79d ago) 111d kube-system coredns-5d78c9869d-5p48t 1/1 Running 1 (79d ago) 111d kube-system coredns-5d78c9869d-kkntt 1/1 Running 1 (79d ago) 111d kube-system etcd-master-1-zaglt7mquy8wg.domain 1/1 Running 2 (59d ago) 111d kube-system kube-apiserver-master-1-zaglt7mquy8wg.domain 1/1 Running 2 (59d ago) 111d kube-system kube-controller-manager-master-1-zaglt7mquy8wg.domain 1/1 Running 1 (79d ago) 111d kube-system kube-proxy-gf2ws 1/1 Running 0 111d kube-system kube-proxy-k2sch 1/1 Running 1 (79d ago) 111d kube-system kube-proxy-tq6nr 1/1 Running 0 111d kube-system kube-scheduler-master-1-zaglt7mquy8wg.domain 1/1 Running 1 (79d ago) 111d kube-system snapshot-controller-55687d7977-2lhqg 1/1 Running 1 (79d ago) 83d kube-system snapshot-controller-55687d7977-x46c6 1/1 Running 0 83d minio minio-0 1/1 Running 0 98d minio minio-1 1/1 Running 0 98d minio minio-2 1/1 Running 0 98d minio minio-3 1/1 Running 0 98d test-vxflexos test-vxflexos-controller-797f95f7c7-xfs7r 5/5 Running 0 3m24s test-vxflexos test-vxflexos-node-js29j 2/2 Running 0 3m24s test-vxflexos test-vxflexos-node-pg75s 2/2 Running 0 3m24s vxflexos vxflexos-controller-99fc4778f-clnzw 5/5 Running 5 (47m ago) 20h vxflexos vxflexos-node-9f6vm 2/2 Running 0 20h vxflexos vxflexos-node-gvrm9 2/2 Running 0 20h [root@master-1-Zaglt7mQUY8Wg e2e]# k get csm -A NAMESPACE NAME CREATIONTIME CSIDRIVERTYPE CONFIGVERSION STATE authorization authorization 5m Failed test-vxflexos test-vxflexos 4m19s powerflex v2.9.1 Failed [root@master-1-Zaglt7mQUY8Wg e2e]#`
No response
Install PowerFlex repeatedly until the CSM stays stuck in a failed state.
CSM object should go into Success state once the
CSI Driver for PowerFlex v2.9.1
CSM-Operator v1.4.1
Kubernetes v1.27.2
RHEL 8.9
Fixed in v1.9.2
Bug Description
While running csm-operator e2e tests for CSM v1.9.2 release, we found that the csm object associated with the powerflex driver intermittently stays stuck in failed state even when all the pods go into running state.
Logs
`[root@master-1-Zaglt7mQUY8Wg e2e]# k describe csm -n test-vxflexos test-vxflexos Name: test-vxflexos Namespace: test-vxflexos Labels:
Annotations: storage.dell.com/CSMOperatorConfigVersion: v2.9.1
storage.dell.com/CSMVersion: v1.9.2
storage.dell.com/PreviouslyAppliedConfiguration:
{"kind":"ContainerStorageModule","apiVersion":"storage.dell.com/v1","metadata":{"name":"test-vxflexos","namespace":"test-vxflexos","uid":"...
API Version: storage.dell.com/v1
Kind: ContainerStorageModule
Metadata:
Creation Timestamp: 2024-02-09T15:18:37Z
Finalizers:
finalizer.dell.emc.com
Generation: 2
Resource Version: 30157378
UID: ccfc9562-404b-45a3-b624-bb8edcd08438
Spec:
Driver:
Common:
Envs:
Name: X_CSI_VXFLEXOS_ENABLELISTVOLUMESNAPSHOT
Value: false
Name: X_CSI_VXFLEXOS_ENABLESNAPSHOTCGDELETE
Value: false
Name: X_CSI_DEBUG
Value: true
Name: X_CSI_ALLOW_RWO_MULTI_POD_ACCESS
Value: false
Name: KUBELET_CONFIG_DIR
Value: /var/lib/kubelet
Name: CERT_SECRET_COUNT
Value: 0
Name: X_CSI_QUOTA_ENABLED
Value: false
Image: dellemc/csi-vxflexos:nightly
Image Pull Policy: IfNotPresent
Config Version: v2.9.1
Controller:
Envs:
Name: X_CSI_HEALTH_MONITOR_ENABLED
Value: false
Csi Driver Spec:
F S Group Policy: File
Storage Capacity: true
Csi Driver Type: powerflex
Dns Policy: ClusterFirstWithHostNet
Force Remove Driver: true
Init Containers:
Envs:
Name: MDM
Value: 10.XXX.XX.XXX,10.XXX.XX.XXX
Image: dellemc/sdc:4.5
Image Pull Policy: IfNotPresent
Name: sdc
Node:
Envs:
Name: X_CSI_HEALTH_MONITOR_ENABLED
Value: false
Name: X_CSI_APPROVE_SDC_ENABLED
Value: false
Name: X_CSI_RENAME_SDC_ENABLED
Value: false
Name: X_CSI_MAX_VOLUMES_PER_NODE
Value: 0
Name: X_CSI_RENAME_SDC_PREFIX
Replicas: 1
Side Cars:
Enabled: false
Envs:
Name: HOST_PID
Value: 1
Name: MDM
Value: 10.XXX.XX.XXX,10.XXX.XX.XXX
Image: dellemc/sdc:4.5
Name: sdc-monitor
Args:
--monitor-interval=60s
Enabled: false
Name: csi-external-health-monitor-controller
Modules:
Components:
Envs:
Name: PROXY_HOST
Value: authorization-ingress-nginx-controller.authorization.svc.cluster.local
Name: SKIP_CERTIFICATE_VALIDATION
Value: true
Image: dellemc/csm-authorization-sidecar:nightly
Name: karavi-authorization-proxy
Config Version: v1.9.1
Enabled: false
Name: authorization
Components:
Enabled: false
Envs:
Name: TOPOLOGY_LOG_LEVEL
Value: INFO
Image: dellemc/csm-topology:nightly
Name: topology
Enabled: false
Envs:
Name: NGINX_PROXY_IMAGE
Value: nginxinc/nginx-unprivileged:1.20
Image: otel/opentelemetry-collector:0.42.0
Name: otel-collector
Enabled: false
Envs:
Name: POWERFLEX_MAX_CONCURRENT_QUERIES
Value: 10
Name: POWERFLEX_SDC_METRICS_ENABLED
Value: true
Name: POWERFLEX_VOLUME_METRICS_ENABLED
Value: true
Name: POWERFLEX_STORAGE_POOL_METRICS_ENABLED
Value: true
Name: POWERFLEX_SDC_IO_POLL_FREQUENCY
Value: 10
Name: POWERFLEX_VOLUME_IO_POLL_FREQUENCY
Value: 10
Name: POWERFLEX_STORAGE_POOL_POLL_FREQUENCY
Value: 10
Name: POWERFLEX_LOG_LEVEL
Value: INFO
Name: POWERFLEX_LOG_FORMAT
Value: TEXT
Name: COLLECTOR_ADDRESS
Value: otel-collector:55680
Image: dellemc/csm-metrics-powerflex:nightly
Name: metrics-powerflex
Config Version: v1.7.0
Enabled: false
Name: observability
Components:
Envs:
Name: X_CSI_REPLICATION_PREFIX
Value: replication.storage.dell.com
Name: X_CSI_REPLICATION_CONTEXT_PREFIX
Value: powerflex
Image: dellemc/dell-csi-replicator:v1.4.0
Name: dell-csi-replicator
Envs:
Name: TARGET_CLUSTERS_IDS
Value: self
Name: REPLICATION_CTRL_LOG_LEVEL
Value: debug
Name: REPLICATION_CTRL_REPLICAS
Value: 1
Name: RETRY_INTERVAL_MIN
Value: 1s
Name: RETRY_INTERVAL_MAX
Value: 5m
Image: dellemc/dell-replication-controller:v1.4.0
Name: dell-replication-controller-manager
Image: dellemc/dell-replication-init:v1.0.0
Name: dell-replication-controller-init
Config Version: v1.4.0
Enabled: false
Name: replication
Components:
Args:
--csisock=unix:/var/run/csi/csi.sock
--labelvalue=csi-vxflexos
--mode=controller
--skipArrayConnectionValidation=false
--driver-config-params=/vxflexos-config-params/driver-config-params.yaml
--driverPodLabelValue=dell-storage
--ignoreVolumelessPods=false
Image: dellemc/podmon:nightly
Image Pull Policy: IfNotPresent
Name: podmon-controller
Args:
--csisock=unix:/var/lib/kubelet/plugins/vxflexos.emc.dell.com/csi_sock
--labelvalue=csi-vxflexos
--mode=node
--leaderelection=false
--driver-config-params=/vxflexos-config-params/driver-config-params.yaml
--driverPodLabelValue=dell-storage
--ignoreVolumelessPods=false
Envs:
Name: X_CSI_PODMON_API_PORT
Value: 8083
Image: dellemc/podmon:nightly
Image Pull Policy: IfNotPresent
Name: podmon-node
Config Version: v1.8.1
Enabled: false
Name: resiliency
Status:
Controller Status:
Available: 0
Desired: 1
Failed: 1
Node Status:
Available: 2
Desired: 2
Failed: 0
State: Failed
Events:
Type Reason Age From Message
Normal Updated 3m12s csm Object finalizer is added Normal Completed 3m11s (x2 over 3m11s) csm install/update storage component: test-vxflexos completed OK Normal Completed 3m11s csm Driver deployment running OK Warning Updated 3m11s csm at 1707491918292836118 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s csm at 1707491918387122708 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s (x3 over 3m11s) csm Failed install: Operation cannot be fulfilled on containerstoragemodules.storage.dell.com "test-vxflexos": the object has been modified; please apply your changes to the latest version and try again Warning Updated 3m11s csm at 1707491918502243242 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s csm at 1707491918603289132 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s csm at 1707491918633307657 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m11s csm at 1707491918702738940 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m10s csm at 1707491919356615788 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m10s csm at 1707491919773768318 Pod error details error message for default-source-cluster PodInitializing= Warning Updated 3m9s (x3 over 3m9s) csm (combined from similar events): at 1707491920766553739 Pod error details error message for default-source-cluster PodInitializing= Normal Completed 3m8s csm at 1707491921372865165 Driver pods running OK Normal Completed 3m8s csm Driver daemonset running OK [root@master-1-Zaglt7mQUY8Wg e2e]# k get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE authorization authorization-ingress-nginx-controller-58cdf8bb96-4gxz5 1/1 Running 0 4m4s authorization cert-manager-765754f9cd-vhpgj 1/1 Running 0 4m4s authorization cert-manager-cainjector-759bbd747b-797kp 1/1 Running 0 4m4s authorization cert-manager-webhook-6fd48c65c8-nhf5t 1/1 Running 0 4m4s authorization proxy-server-5755f8cbdd-hx4t7 3/3 Running 0 4m5s authorization redis-commander-5475c6469b-r9znn 1/1 Running 0 4m5s authorization redis-primary-76c94759c4-gsg24 1/1 Running 0 4m5s authorization role-service-5c945689cb-mh97b 1/1 Running 0 4m5s authorization storage-service-56db7c6fbf-bk9lf 1/1 Running 0 4m5s authorization tenant-service-58dd6ff68c-sdspk 1/1 Running 0 4m5s dell-csm-operator dell-csm-operator-controller-manager-884f7cb9b-kmqrm 1/1 Running 0 170m kube-flannel kube-flannel-ds-5vmqz 1/1 Running 0 111d kube-flannel kube-flannel-ds-nwfs9 1/1 Running 0 111d kube-flannel kube-flannel-ds-z2875 1/1 Running 1 (79d ago) 111d kube-system coredns-5d78c9869d-5p48t 1/1 Running 1 (79d ago) 111d kube-system coredns-5d78c9869d-kkntt 1/1 Running 1 (79d ago) 111d kube-system etcd-master-1-zaglt7mquy8wg.domain 1/1 Running 2 (59d ago) 111d kube-system kube-apiserver-master-1-zaglt7mquy8wg.domain 1/1 Running 2 (59d ago) 111d kube-system kube-controller-manager-master-1-zaglt7mquy8wg.domain 1/1 Running 1 (79d ago) 111d kube-system kube-proxy-gf2ws 1/1 Running 0 111d kube-system kube-proxy-k2sch 1/1 Running 1 (79d ago) 111d kube-system kube-proxy-tq6nr 1/1 Running 0 111d kube-system kube-scheduler-master-1-zaglt7mquy8wg.domain 1/1 Running 1 (79d ago) 111d kube-system snapshot-controller-55687d7977-2lhqg 1/1 Running 1 (79d ago) 83d kube-system snapshot-controller-55687d7977-x46c6 1/1 Running 0 83d minio minio-0 1/1 Running 0 98d minio minio-1 1/1 Running 0 98d minio minio-2 1/1 Running 0 98d minio minio-3 1/1 Running 0 98d test-vxflexos test-vxflexos-controller-797f95f7c7-xfs7r 5/5 Running 0 3m24s test-vxflexos test-vxflexos-node-js29j 2/2 Running 0 3m24s test-vxflexos test-vxflexos-node-pg75s 2/2 Running 0 3m24s vxflexos vxflexos-controller-99fc4778f-clnzw 5/5 Running 5 (47m ago) 20h vxflexos vxflexos-node-9f6vm 2/2 Running 0 20h vxflexos vxflexos-node-gvrm9 2/2 Running 0 20h [root@master-1-Zaglt7mQUY8Wg e2e]# k get csm -A NAMESPACE NAME CREATIONTIME CSIDRIVERTYPE CONFIGVERSION STATE authorization authorization 5m Failed test-vxflexos test-vxflexos 4m19s powerflex v2.9.1 Failed [root@master-1-Zaglt7mQUY8Wg e2e]#`
Screenshots
No response
Additional Environment Information
No response
Steps to Reproduce
Install PowerFlex repeatedly until the CSM stays stuck in a failed state.
Expected Behavior
CSM object should go into Success state once the
CSM Driver(s)
CSI Driver for PowerFlex v2.9.1
Installation Type
CSM-Operator v1.4.1
Container Storage Modules Enabled
No response
Container Orchestrator
Kubernetes v1.27.2
Operating System
RHEL 8.9