Concurrency issue when provisioning multiple encrypted CephFS instances

NymanRobin commented 2 months ago

When creating multiple pods with separate PVC's the ceph-csi might fail to provision some of the encrypted cephfs instances.

The error from the pods logs is the following:

MountVolume.SetUp failed for volume "pvc-443ef844-03ce-4b87-affe-a802009aa915" : rpc error: code = Internal desc = no fscrypt policy set on directory "/var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.cephfs.csi.ceph.com/32418680d66053d8dac8f56697397c8f91f36ca386dbb8d0a0d642bb76609721/globalmount/ceph-csi-encrypted": file or directory "/var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph.cephfs.csi.ceph.com/32418680d66053d8dac8f56697397c8f91f36ca386dbb8d0a0d642bb76609721/globalmount/ceph-csi-encrypted" is not encrypted

From this it seems setting up the encryption fails in the ceph-csi from the cephfs-csi plugin logs this can be seen, which makes me suspect a concurrency issue:

E0426 12:53:04.134079       1 fscrypt.go:262] ID: 9782 Req-ID: 0001-0009-rook-ceph-0000000000000002-a3186144-9960-42dd-9fce-4cc125117e1a fscrypt: Failed to apply protector (see also kernel log): %!w(*actions.ErrDifferentFilesystem=&{0xc00157b130 0xc00208a140})

Environment

This problem is reproducible in the rook development environment from master and kernel support can be setup for example by following the instructions that was provided in this PR #3460

The following bash script can be executed and the problem should be seen at least for me it happens every time without exception only variance is how many pods fail to deploy

#!/bin/bash

for i in {1..10}
do
    # Generate unique names for the pod and PVC using the counter variable
    pod_name="pod-$i"
    pvc_name="pvc-$i"

    echo "${pvc_name}"
    echo "${pod_name}"

    # Your command to create a pod and PVC with unique names
    cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: $pod_name
spec:
  containers:
  - name: my-container
    image: nginx
    volumeMounts:
    - mountPath: /var/myVolume
      name: my-volume
  volumes:
  - name: my-volume
    persistentVolumeClaim:
      claimName: $pvc_name
EOF

    cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: $pvc_name
spec:
 storageClassName: rook-cephfs
 accessModes:
    - ReadWriteOnce
 resources:
    requests:
      storage: 1Gi
EOF
    # If sleep is uncommented this code runs without problems!
    # sleep 10 
done

I will try to debug this further but if anyone has any ideas or pointer regarding it I am all ears or if anyone has seen this by chance and has some ideas, thanks!

NymanRobin commented 2 months ago

Will keep digging into the issue, can also upload the full logs here if anyone is interested!

NymanRobin commented 2 months ago

It appears to me that this would be ideally addressed within the fscrypt module. I'll open an issue there and see what their response is. Explanation below, other option I can think of currently would be to limit the concurrency.

The problem seems to arise from the fact that when a new mount is created, the UpdateMountInfo function in the fscrypt module is called. However, this function recreates the mountsByDevice map each time it's called. Consequently, the memory references of the mount objects get updated. This results in a mismatch in policy.apply cause we store the old reference in the fscryptContext (%!w(*actions.ErrDifferentFilesystem=&{0xc00157b130 0xc00208a140}).

So, when the lookup from the map is performed and compared to the context in the policy.apply, the memory addresses don't match, even though the device number and path are the same in the new mount object.

I tested if we, keep the old references in the map and only update new this issue is resolved so let's hope that is acceptable for fscrypt

nixpanic commented 2 months ago

Great find, thanks for identifying the issue! Let us know if you need assistance with a fix for fscrypt.

NymanRobin commented 2 months ago

Thanks for the offer, but the change was small so I got it! Here is the PR: https://github.com/google/fscrypt/pull/413

@nixpanic should we keep this issue open to upgrade the fscrypt version once it is available?

nixpanic commented 2 months ago

@nixpanic should we keep this issue open to upgrade the fscrypt version once it is available?

If you want to keep this open until Ceph-CSI rebased the package, that is fine. I leave it up to you what you want to do.

NymanRobin commented 1 month ago

The fscrypt PR is merged, but when discussing with maintainers they mentioned it might be months until the next release..

Madhu-1 commented 1 month ago

The fscrypt PR is merged, but when discussing with maintainers they mentioned it might be months until the next release..

@NymanRobin we can update the go.mod to point to the exact commit that we require until we get the next release.

NymanRobin commented 1 month ago

@Madhu-1 Thanks for the information! I opened a PR for this, let me know if it looks okay to you Also is it possible to get the cherry-pick to 3.11 once this PR gets merged?

Madhu-1 commented 1 month ago

@Madhu-1 Thanks for the information! I opened a PR for this, let me know if it looks okay to you Also is it possible to get the cherry-pick to 3.11 once this PR gets merged?

Yes we can do that if someone is blocked because of this fix.

NymanRobin commented 1 month ago

Awesome, yes that would help us get further with this feature!

I see the PR cannot merge because of this error:

Unable to update: user NymanRobin is unknown.
Please make sure NymanRobin has logged in Mergify dashboard.

NymanRobin commented 1 month ago

Thanks for the help getting the changes in all! Is there any plans for an patch release of 3.11, this is not super urgent just thought I would ask in case there is a plan already

Madhu-1 commented 1 month ago

@NymanRobin not yet, we will plan it next 1 or 2 weeks.

ceph / ceph-csi

Concurrency issue when provisioning multiple encrypted CephFS instances #4592

Environment