ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.28k stars 548 forks source link

Race condition in RBD node plugin may lead to total data loss while unmounting #4966

Open martin-helmich opened 3 hours ago

martin-helmich commented 3 hours ago

Describe the bug

Recently, we encountered a total data loss in a few of our RBD volumes.

From the logs of the RBD nodeplugin, we inferred that the following condition occurred (all these steps can be correlated to the rbd-nodeplugin logs, attached to this issue):

Environment details

Steps to reproduce

Steps to reproduce the behavior:

  1. Unknown. Assistance in further debugging might be required.

Presumably:

  1. Schedule a Pod mounting an RBD volume
  2. Have kubelet crash+restart shortly after mounting the volume, for example due to OOM
  3. Terminate the Pod, causing kubelet to unmount (and involuntarily purge) the volume

Actual results

Total data loss in the RBD volume, due to the RBD node plugin running a recursive os.RemoveAll on a still-mounted volume.

Expected behavior

Mounting and unmounting (even under load and occasional error conditions) should work without data corruption.

Logs

Attached to this issue are our logs from the ceph-csi-rbd-nodeplugin from the time of the incident (newest entries at the top).

Explore-logs-2024-11-18 19_34_03.txt

Additional context

Quick&dirty workaround

We implemented a quick workaround by replacing the os.RemoveAll call with a regular os.Remove, which we can also provide as PR if required. Our reasoning for this was as follows:

"The kubelet should make sure that this does not happen"

From reading the code comments around the os.RemoveAll call in question, it seems to be presumed that the kubelet will guard against concurrent volume operations:

https://github.com/ceph/ceph-csi/blob/3e9b438e7c3a756dd4c6160a7c793d97ab92053f/internal/rbd/nodeserver.go#L917-L918

This does not seem to be the case. In fact, the CSI spec^csi-spec explicitly recognizes that (emphasis mine) "in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The Plugin, SHOULD handle this as gracefully as possible, and MAY return this error code to reject secondary calls."

Furthermore, the CephFS node plugin seems to already follow this recommendation, and both checks if the target path is actually a mount point and only uses a simple os.Remove call instead of os.RemoveAll (although even that might be susceptible to race conditions):

https://github.com/ceph/ceph-csi/blob/3e9b438e7c3a756dd4c6160a7c793d97ab92053f/internal/cephfs/nodeserver.go#L642-L645

It should also be noted that the CephFS implementation also checks in its NodePublishVolume implementation if the target directory is already a mountpoint, and just returns early in this case (which would also prevent issues like these):

https://github.com/ceph/ceph-csi/blob/3e9b438e7c3a756dd4c6160a7c793d97ab92053f/internal/cephfs/nodeserver.go#L551-L562

Suggested mitigations


Credits for discovering this issue and the associated debugging go to @hensur and @Lucaber.

Madhu-1 commented 3 hours ago

In NodeUnpublishVolume: Use os.RemoveAll only (if at all) after asserting that the target path is not still a mountpoint. Note that this alone might still be potentially racy. In NodePublishVolume: Test if the target path is already bindmounted to the staging path, and return early without mounting it again.

@martin-helmich Thanks for checking this one, i expect that https://github.com/ceph/ceph-csi/blob/3e9b438e7c3a756dd4c6160a7c793d97ab92053f/internal/rbd/nodeserver.go#L875-L878 should try to detect already existing mounts and return success instead of going and doing mount again. it also mentioned here that https://github.com/kubernetes/utils/blob/6fe5fd82f078d1e207dddbba58d9427258f7956b/mount/mount.go#L59 it might not detect the bind mounts, we need to check on this one, Yes we should not be using os.Remove instead it should be replaced by os.remove

To be entirely sure, it might also be advisable to add a (per-volume) locking mechanism protecting all volume operations, to ensure that no operations run in parallel.

we had this earlier and it was removed later that this wont happen and also to speed up the mounts.

Madhu-1 commented 3 hours ago

For reproducing the issue, Lets call the NodePublish (one more time) internally right after the success to trigger two NodePublish calls (to mimic the behaviour)