Open Madhu-1 opened 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
Moving it out for 3.6 as a fix for this is not available in ceph yet.
removed from the milestone tracker.
@Madhu-1 shall we move this from 3.7 too ?
Moving out of 3.7.0 release.
Problem:- During the failover operation, the volume replication tries image promote action to make the rbd image as primary if the promote operation fails it calls promote again with force operation. In some cases, the force promote hangs indefinitely and never returns back because we are using the go-ceph API and there is no step to cancel the ongoing operations and the only option to get out of it is to restart the rbd provisioner pod. the indefinite hang might be due to the bug in RBD (still investigation is going on)
Workaround:-
The force promote operation should be executed with a timeout so that the command never gets hang and follow-up API calls can force promote the volume.
similar issues:- https://github.com/ceph/ceph-csi/issues/553
upstream ceph tracker: https://tracker.ceph.com/issues/52913 https://bugzilla.redhat.com/show_bug.cgi?id=2030752