eth-cscs / sarus

OCI-compatible engine to deploy Linux containers on HPC environments.
https://sarus.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
127 stars 10 forks source link

Sarus file-lock acquisition times out with NFS4 shares #36

Open matteoguglielmi opened 1 week ago

matteoguglielmi commented 1 week ago

Sarus file-lock acquisition times out for files in ~/.sarus when user homes are shared via NFS4.X (but works with NFS3):

[1229.860589957] [node01-6321] [Flock] [WARN] Still attempting to acquire lock on file "/cluster/raid/home/software/.sarus/metadata.json-ujjdgvwdlmqabqug" after 800 ms (will timeout after 1000 milliseconds)...

Thank you for any help.

Madeeks commented 1 week ago

Hi @matteoguglielmi, thanks for reporting this. Does NFS4.X support the flock(2) system call? Sarus is using that function to implement atomic access to the local repository metadata file. It's possible that some shared/networked file systems do not offer support (either partial or complete) for flock(2). That would explain the inability to acquire a lock.

matteoguglielmi commented 1 week ago

Hi @Madeeks, I found this thread, which seems to explain why flock(2) is not working with NFS4 and suggests using alternate functions. By the way, I've compiled and tested the posted C-code to find out I get the same error message when running it on an NFS4 share.

Madeeks commented 5 days ago

Thanks for confirming the missing flock(2) support on NFS4. What we can do in the short term is to make the lock implementation selectable through a configuration parameter and re-introduce the old implementation based on an explicit lockfile created by Sarus, to work as an alternative to flock-based locking. The old code only supports exclusive locking (causes noticeable delays when starting O(1000) containers) and its cleanup is not super-robust, but it should work on any kind of filesystem.