z2000l commented 3 weeks ago

Describe the bug

This is an extension/reopen of https://github.com/ceph/ceph-csi/issues/4592. Our test showed that when creating a deployment of replica 2, with high chance that one pod will fail to mount the pvc.

Environment details

Same as https://github.com/ceph/ceph-csi/issues/4592

Steps to reproduce

Steps to reproduce the behavior:

Change the pod yaml in issue 4592 to a deployment
Make replicas to 2(or high enough so pods will be running on different nodes)

Actual results

Only one pod will be running

Expected behavior

All pods will be running

Madhu-1 commented 3 weeks ago

@z2000l Instead of reopening the issue, can you be more specific about the problem with details?

The Fix is not part of any release yet.
Are you also testing cephfs encryption
Have you tried with canary cephcsi image

z2000l commented 3 weeks ago

Yes, we're trying to test cephfs encryption. We used rook to orchestrate our ceph cluster. We applied the patch to ceph csi 3.10.2. The patch improved fscrypt a lot. Still we experienced problems when replica is high, especially when pods were running on multiple nodes: a subset of the replicas were running without a problem, some of them were stuck in ContainerCreating with error messges: MountVolume.MountDevice failed for volume "pvc-ad63a27d-4651-42ae-a96c-ca21de963a38" : rpc error: code = Internal desc = fscrypt: unsupported state metadata=true kernel_policy=false

We suspect when multiple nodes try to setup fscrypt policy at the same time, the failed nodes don't refresh or sync to get the policy and get stuck.

And which canary cephcsi image you'd like us to try? Thanks.

Madhu-1 commented 3 weeks ago

@z2000l Thanks for the response , quay.io/cephcsi/cephcsi:canary can you please use this message and also please provide the logs and steps to reproduce so that others can try to reproduce it? it could be problem with the fscrypt library we are using we need to open issues with fscrypt if we have any.

@NymanRobin is it something still happening or its fixed?

NymanRobin commented 3 weeks ago

Thanks for notifying me @Madhu-1 I thought this was fixed, but in my testing I did not use multiple nodes so this might be an issue still as pointed out by @z2000l. I will try with multiple nodes and see if I can reproduce

Madhu-1 commented 3 weeks ago

Thanks for notifying me @Madhu-1 I thought this was fixed, but in my testing I did not use multiple nodes so this might be an issue still as pointed out by @z2000l. I will try with multiple nodes and see if I can reproduce

Thank you 👍🏻

NymanRobin commented 3 weeks ago

I can confirm that this is consistently reproducible when using a deployment with multiple replicas spread over nodes and RWX accessMode for the pvc. I will start looking into a solution, not sure yet if the problem is in fscrypt or ceph-csi implementation

NymanRobin commented 3 weeks ago

I now understand the error clearly. It seems that the fscrypt client handles metadata and policy differently, leading to this error. When checking for metadata, if it doesn't exist, we create it. However, this isn't the case for policy checks. This results in scenarios where, for instance, if three pods are created simultaneously, the first creates metadata, and the subsequent two find it. However, none of these pods find the policy, leading to a mismatch because the subsequent two have metadata but no policy.

The solution is to properly set up the policy and metadata initially. Once this is done, the race condition won't occur. As a temporary workaround, one can start the deployment with one replica, wait for it to be in the running state, and then scale up without issue.

I believe a permanent solution shouldn't be too difficult. Best idea I currently have is to check if the initial setup is done, this needs to be behind a lock and the lock can be released if the check is okay or when the setup is done. I will still do some more thinking about the best course of action, but how does this initially sound to you @Madhu-1

EDIT: Okay actually the pods are spread over nodes so a mutex won't solve this it will need a lease or some other mechanism to ensure it is ran only once

/assign

github-actions[bot] commented 3 weeks ago

Thanks for taking this issue! Let us know if you have any questions!

github-actions[bot] commented 3 weeks ago

The issue you are trying to assign to yourself is already assigned.

Madhu-1 commented 3 weeks ago

I believe a permanent solution shouldn't be too difficult. Best idea I currently have is to check if the initial setup is done, this needs to be behind a lock and the lock can be released if the check is okay or when the setup is done. I will still do some more thinking about the best course of action, but how does this initially sound to you @Madhu-1

EDIT: Okay actually the pods are spread over nodes so a mutex won't solve this it will need a lease or some other mechanism to ensure it is ran only once

Thanks for exploring it. Yes, as you mentioned internal locking might not be good, we need to see if we can have something else for this one.

nixpanic commented 2 weeks ago

Hi @NymanRobin,

You might be able to use VolumeLocks from internal/util/idlocker.go for locking too once you add a new fscrypt operation.

Thanks for looking into this!

NymanRobin commented 2 weeks ago

Hey @nixpanic!

Thanks for the suggestion but based on our last discussion I still think the Rados omap is the way to go.

Let me try to explain, so how I understand it is that the lock seems to be already acquired in the NodStageVolume on line 170, meaning that we already are in a locked state when entering function maybeUnlockFileEncryption. Why this does not prevent the race condition is because the VolumeLocks map is local to each node and multiple nodes can stage & encrypt a volume simultaneously in the RWX case

Do you agree with this understanding?

I think my best idea currently would be to implement a similar interface as the reftracker that would track the encryption of the cephfs filesystem to ensure the encryption is only set up once

NymanRobin commented 2 weeks ago

I tested the Rados omap locking with a simple proof of concept, and it appears to be working well!

During testing, I created a reference object based on the volume in the refTracker. If this succeeded, I proceeded with the encryption setup and, upon exiting the function, deleted the reference. This straightforward lock mechanism resolved all issues, and all pods started up correctly.

However, I believe this can be further improved by creating a dedicated interface for encryption tracking.

nixpanic commented 2 weeks ago

Let me try to explain, so how I understand it is that the lock seems to be already acquired in the NodStageVolume on line 170, meaning that we already are in a locked state when entering function maybeUnlockFileEncryption. Why this does not prevent the race condition is because the VolumeLocks map is local to each node and multiple nodes can stage & encrypt a volume simultaneously in the RWX case

Do you agree with this understanding?

Yes, if VolumeLocks do not store their state in Rados, they are definitely not usable.

I tested the Rados omap locking with a simple proof of concept, and it appears to be working well!

During testing, I created a reference object based on the volume in the refTracker. If this succeeded, I proceeded with the encryption setup and, upon exiting the function, deleted the reference. This straightforward lock mechanism resolved all issues, and all pods started up correctly.

Great work! Sounds really good :clap:

However, I believe this can be further improved by creating a dedicated interface for encryption tracking.

If you have a generic "Rados lock/mutex" function, we might be able to use it for other things too. I can't immediately think of something else than encryption, though. Making it encryption specific is fine, but ideally it is easily adaptable later on to be more generic (if we find a usecase for it).

Thanks again!

Madhu-1 commented 2 weeks ago

Let me try to explain, so how I understand it is that the lock seems to be already acquired in the NodStageVolume on line 170, meaning that we already are in a locked state when entering function maybeUnlockFileEncryption. Why this does not prevent the race condition is because the VolumeLocks map is local to each node and multiple nodes can stage & encrypt a volume simultaneously in the RWX case Do you agree with this understanding?

Yes, if VolumeLocks do not store their state in Rados, they are definitely not usable.

I tested the Rados omap locking with a simple proof of concept, and it appears to be working well! During testing, I created a reference object based on the volume in the refTracker. If this succeeded, I proceeded with the encryption setup and, upon exiting the function, deleted the reference. This straightforward lock mechanism resolved all issues, and all pods started up correctly.

Great work! Sounds really good 👏

@NymanRobin what about the case where we took the lock and csi restarted and the app pod scale downed to 1 where it needs to run on a specific node instead of two nodes, will this also helps?

However, I believe this can be further improved by creating a dedicated interface for encryption tracking.

If you have a generic "Rados lock/mutex" function, we might be able to use it for other things too. I can't immediately think of something else than encryption, though. Making it encryption specific is fine, but ideally it is easily adaptable later on to be more generic (if we find a usecase for it).

Thanks again!

NymanRobin commented 2 weeks ago

@Madhu-1, I currently do not see any problems with the scale-down scenario, but maybe I am missing something?

I also do not foresee any issues with scaling down, as ideally, once a node has set up the encryption, there is no need to acquire the lock again. This is why I considered a specific encryption interface to to first do a lookup if the volume has already been encrypted, thus avoiding the need to lock it again in case it is already encrypted. This approach could be an extension of the lock mechanism. Although a normal mutex could also work, it will be slightly slower during large-scale operations. However, the lock will only be active for a very short period of time is the key point here. On another note if the update/delete call for the lock fails, we need to maintain a list of orphaned locks and attempt to clean them periodically is my thought

That said, I understand the concern about: Acquire lock -> pod restarts -> Orphaned lock

Is a valid point I hadn't considered earlier. Handling a pod restart in a locked state is more challenging, and I do not have a perfect solution on top of my mind for this scenario. Maybe implementing an expiry time for the lock could solve this? What do you think @Madhu-1 and @nixpanic?

nixpanic commented 2 weeks ago

A lock that can time-out would be ideal, I guess. Depending on how complex you want to make it, this could be an approach:

obtain the lock
1. set owner id of the lock (useful for logging)
2. set timeout timestamp
3. do operation
4. (optional) update timeout timestamp to inform others the lock is still active
5. do more operation(s)
6. release the lock
blocked while obtaining the lock
1. report the owner of the lock
2. retry getting the lock, take timeout in account
3. if timed-out, warn about it, including the owner
4. if the lock was released, just report an informational message

NymanRobin commented 1 week ago

Hey everyone,

I've just opened a draft PR (#4688) that addresses the concurrency issue we've been facing. While it's still a bit rough around the edges and misses some functionality, it implements the necessary functionality for resolving the concurrency issue. I made a check list in the PR of what could still be done or looked into

I opened the PR in this state now because I'm going on vacation for the next four weeks and don't want to lose momentum on this issue. I'll be working to finalize as much as possible today, but any remaining tasks are open for anyone who feels inspired to take over.

From our side, @Sunnatillo will be available to assist with this.

Thanks!

Madhu-1 commented 1 week ago

Hey everyone,

I've just opened a draft PR (#4688) that addresses the concurrency issue we've been facing. While it's still a bit rough around the edges and misses some functionality, it implements the necessary functionality for resolving the concurrency issue. I made a check list in the PR of what could still be done or looked into

I opened the PR in this state now because I'm going on vacation for the next four weeks and don't want to lose momentum on this issue. I'll be working to finalize as much as possible today, but any remaining tasks are open for anyone who feels inspired to take over.

From our side, @Sunnatillo will be available to assist with this.

Thanks!

@NymanRobin Thank you for taking care of it, Enjoy your vacation

NymanRobin commented 1 week ago

Thanks a lot @Madhu-1! I however now noticed that the ceph-go library already supports LockExclusive, so probably not the best idea to reinvent the wheel as I have done here and instead use that. The locking supported the functionality that we need on an initial look. Oh well, if this remains I will fix my mess after the vacation :smile:

Sunnatillo commented 4 days ago

Hi @Madhu-1 @nixpanic. Should we use LockExclusive from ceph-go or the @NymanRobin's current implementation is fine?

Madhu-1 commented 3 days ago

@Sunnatillo yes i think using https://github.com/ceph/go-ceph/blob/ee25db94c6595ad3dca6cd66d7f290df3baa7c9a/rados/ioctx.go#L463-L508 LockExclusive from rados does what described above by @nixpanic am fine with using it instead of reinvent the same. @nixpanic can you please check this as well.

nixpanic commented 3 days ago

Hi @Madhu-1 @nixpanic. Should we use LockExclusive from ceph-go or the @NymanRobin's current implementation is fine?

The simpler it can be done, the better!

ceph / ceph-csi

Concurrency issue when pvc is mounted to multiple pods #4654

Describe the bug

Environment details

Steps to reproduce

Actual results

Expected behavior