ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.27k stars 539 forks source link

PVC always be pending (failed) after startup few days. #3287

Closed jpsn123 closed 2 years ago

jpsn123 commented 2 years ago

Describe the bug

PVC always be pending (failed) after startup few days. if i restart the csi-rbd-provisioner pod, the blocked PVC will be readly, and every thing will be ok, but just several days later, PVC will be pending again and need to restart csi-rbd-provisioner pod manually.

Environment details

log from csi-provisoner

I0805 05:27:49.263038       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:27:54.277638       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:27:59.294075       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:28:04.315262       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:28:09.334686       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:28:10.792347       1 controller.go:1337] provision "jmp/minio" class "jutze-block-ssd-rep-base-pub2": started
I0805 05:28:10.795361       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"jmp", Name:"minio", UID:"1af8d583-628b-4367-a2a6-fd8fe0f0f89d", APIVersion:"v1", ResourceVersion:"202885700", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "jmp/minio"
I0805 05:28:10.796182       1 controller.go:528] skip translation of storage class for plugin: rbd.csi.ceph.com
I0805 05:28:10.817050       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateVolume
I0805 05:28:10.817588       1 connection.go:184] GRPC request: {"capacity_range":{"required_bytes":8589934592},"name":"pvc-1af8d583-628b-4367-a2a6-fd8fe0f0f89d","parameters":{"clusterID":"ceph-pub2","csi.storage.k8s.io/pv/name":"pvc-1af8d583-628b-4367-a2a6-fd8fe0f0f89d","csi.storage.k8s.io/pvc/name":"minio","csi.storage.k8s.io/pvc/namespace":"jmp","imageFeatures":"layering,exclusive-lock,object-map,fast-diff,deep-flatten","pool":"jutze.base.rbd","snapshotNamePrefix":"csi-snap-ssd-rep-base-pub2-","volumeNamePrefix":"csi-vol-ssd-rep-base-pub2-"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0805 05:28:14.085972       1 reflector.go:536] sigs.k8s.io/sig-storage-lib-external-provisioner/v8/controller/controller.go:845: Watch close - *v1.PersistentVolume total 10 items received
I0805 05:28:14.349289       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:28:19.364607       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:28:24.380493       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:28:29.397610       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
.
.
.
I0805 05:32:50.292706       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:32:55.306197       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:32:58.920109       1 reflector.go:536] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PersistentVolumeClaim total 9 items received
I0805 05:33:00.323129       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:33:05.340001       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:33:10.358546       1 leaderelection.go:278] successfully renewed lease ceph/rbd-csi-ceph-com
I0805 05:33:10.817725       1 connection.go:186] GRPC response: {}
I0805 05:33:10.817893       1 connection.go:187] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0805 05:33:10.817997       1 controller.go:764] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Background: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0805 05:33:10.818214       1 controller.go:1082] Temporary error received, adding PVC 1af8d583-628b-4367-a2a6-fd8fe0f0f89d to claims in progress
W0805 05:33:10.818318       1 controller.go:934] Retrying syncing claim "1af8d583-628b-4367-a2a6-fd8fe0f0f89d", failure 0
E0805 05:33:10.818448       1 controller.go:957] error syncing claim "1af8d583-628b-4367-a2a6-fd8fe0f0f89d": failed to provision volume with StorageClass "jutze-block-ssd-rep-base-pub2": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0805 05:33:10.818566       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"jmp", Name:"minio", UID:"1af8d583-628b-4367-a2a6-fd8fe0f0f89d", APIVersion:"v1", ResourceVersion:"202885700", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "jutze-block-ssd-rep-base-pub2": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0805 05:33:11.318627       1 controller.go:1337] provision "jmp/minio" class "jutze-block-ssd-rep-base-pub2": started
I0805 05:33:11.319088       1 controller.go:528] skip translation of storage class for plugin: rbd.csi.ceph.com
I0805 05:33:11.319191       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"jmp", Name:"minio", UID:"1af8d583-628b-4367-a2a6-fd8fe0f0f89d", APIVersion:"v1", ResourceVersion:"202885700", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "jmp/minio"
I0805 05:33:11.326700       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateVolume
I0805 05:33:11.326731       1 connection.go:184] GRPC request: {"capacity_range":{"required_bytes":8589934592},"name":"pvc-1af8d583-628b-4367-a2a6-fd8fe0f0f89d","parameters":{"clusterID":"ceph-pub2","csi.storage.k8s.io/pv/name":"pvc-1af8d583-628b-4367-a2a6-fd8fe0f0f89d","csi.storage.k8s.io/pvc/name":"minio","csi.storage.k8s.io/pvc/namespace":"jmp","imageFeatures":"layering,exclusive-lock,object-map,fast-diff,deep-flatten","pool":"jutze.base.rbd","snapshotNamePrefix":"csi-snap-ssd-rep-base-pub2-","volumeNamePrefix":"csi-vol-ssd-rep-base-pub2-"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0805 05:33:11.328916       1 connection.go:186] GRPC response: {}
I0805 05:33:11.329104       1 connection.go:187] GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-1af8d583-628b-4367-a2a6-fd8fe0f0f89d already exists
I0805 05:33:11.330289       1 controller.go:764] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Background: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-1af8d583-628b-4367-a2a6-fd8fe0f0f89d already exists
I0805 05:33:11.330479       1 controller.go:1082] Temporary error received, adding PVC 1af8d583-628b-4367-a2a6-fd8fe0f0f89d to claims in progress
W0805 05:33:11.330572       1 controller.go:934] Retrying syncing claim "1af8d583-628b-4367-a2a6-fd8fe0f0f89d", failure 1
E0805 05:33:11.330672       1 controller.go:957] error syncing claim "1af8d583-628b-4367-a2a6-fd8fe0f0f89d": failed to provision volume with StorageClass "jutze-block-ssd-rep-base-pub2": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-1af8d583-628b-4367-a2a6-fd8fe0f0f89d already exists
I0805 05:33:11.330748       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"jmp", Name:"minio", UID:"1af8d583-628b-4367-a2a6-fd8fe0f0f89d", APIVersion:"v1", ResourceVersion:"202885700", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "jutze-block-ssd-rep-base-pub2": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-1af8d583-628b-4367-a2a6-fd8fe0f0f89d already exists
I0805 05:33:12.331767       1 controller.go:1337] provision "jmp/minio" class "jutze-block-ssd-rep-base-pub2": started
I0805 05:33:12.331896       1 controller.go:528] skip translation of storage class for plugin: rbd.csi.ceph.com
I0805 05:33:12.332872       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"jmp", Name:"minio", UID:"1af8d583-628b-4367-a2a6-fd8fe0f0f89d", APIVersion:"v1", ResourceVersion:"202885700", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "jmp/minio"
I0805 05:33:12.340327       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateVolume
I0805 05:33:12.340521       1 connection.go:184] GRPC request: {"capacity_range":{"required_bytes":8589934592},"name":"pvc-1af8d583-628b-4367-a2a6-fd8fe0f0f89d","parameters":{"clusterID":"ceph-pub2","csi.storage.k8s.io/pv/name":"pvc-1af8d583-628b-4367-a2a6-fd8fe0f0f89d","csi.storage.k8s.io/pvc/name":"minio","csi.storage.k8s.io/pvc/namespace":"jmp","imageFeatures":"layering,exclusive-lock,object-map,fast-diff,deep-flatten","pool":"jutze.base.rbd","snapshotNamePrefix":"csi-snap-ssd-rep-base-pub2-","volumeNamePrefix":"csi-vol-ssd-rep-base-pub2-"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0805 05:33:12.343730       1 connection.go:186] GRPC response: {}
I0805 05:33:12.344138       1 connection.go:187] GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-1af8d583-628b-4367-a2a6-fd8fe0f0f89d already exists
nixpanic commented 2 years ago

the CreateVolume request for pvc-1af8d583-628b-4367-a2a6-fd8fe0f0f89d never finished. The RBD-image should have been in the creation process. This can get stuck when the Ceph pool that should hold the image is not healthy. Please check the status of the Ceph cluster and the availability of the OSDs.

jpsn123 commented 2 years ago

@nixpanic ceph cluster is always health, i sure that. it seem rbdplugin lost connect to ceph cluster and can't reconnect to it, just need a restart, all will ok.

jpsn123 commented 2 years ago

in addition, if success create pvc, csi-rbdplugin and csi-rbdplugin-contoller will both show some action logs, but if pvc create failed, only csi-rbdplugin some logs , csi-rbdplugin-contoller container show nothing except successfully renewed lease ceph/rbd.csi.ceph.com-ceph.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 years ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.