ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.29k stars 548 forks source link

rbd: no option to cancel the stuck rbd force promote operation #2736

Open Madhu-1 opened 2 years ago

Madhu-1 commented 2 years ago

Problem:- During the failover operation, the volume replication tries image promote action to make the rbd image as primary if the promote operation fails it calls promote again with force operation. In some cases, the force promote hangs indefinitely and never returns back because we are using the go-ceph API and there is no step to cancel the ongoing operations and the only option to get out of it is to restart the rbd provisioner pod. the indefinite hang might be due to the bug in RBD (still investigation is going on)

Workaround:-

The force promote operation should be executed with a timeout so that the command never gets hang and follow-up API calls can force promote the volume.

similar issues:- https://github.com/ceph/ceph-csi/issues/553

upstream ceph tracker: https://tracker.ceph.com/issues/52913 https://bugzilla.redhat.com/show_bug.cgi?id=2030752

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

Madhu-1 commented 2 years ago

Moving it out for 3.6 as a fix for this is not available in ceph yet.

humblec commented 2 years ago

removed from the milestone tracker.

humblec commented 2 years ago

@Madhu-1 shall we move this from 3.7 too ?

humblec commented 2 years ago

Moving out of 3.7.0 release.