Open clil16 opened 2 weeks ago
This is very interesting. It is possible that a file system change on a host does not immediately become visible on other hosts but I think it is guaranteed by the POSIX specification that they are ordered within a process.
What is the backend of your NFS storage? Cannot you fix this by changing server or client options (e.g. caching parameter)? Because nobody has reported this error before, I suspect this is very specific to your system and we are reluctant to incorporate changes only to address sub-optimally configured system.
Thank you for taking the time to look at this issue.
This is very interesting. It is possible that a file system change on a host does not immediately become visible on other hosts but I think it is guaranteed by the POSIX specification that they are ordered within a process.
In this case, I was just opening the RELION GUI so only a single host was involved. I agree it is very interesting and strange but as I understand it NFS is not POSIX compliant.
What is the backend of your NFS storage? Cannot you fix this by changing server or client options (e.g. caching parameter)? Because nobody has reported this error before, I suspect this is very specific to your system and we are reluctant to incorporate changes only to address sub-optimally configured system.
Our NFS storage system is a Dell PowerScale cluster (formerly known as an Isilon) that has been working extremely well for several years and no changes were made recently to any of the systems involved. I'm working on ways to test this more thoroughly but with the issue being intermittent it is difficult to recreate. I don't think our systems are sub-optimally configured but there may be something related to the cache causing weirdness.
@clil16 Can you contact Dell support on the issue?
I just wanted to chip in and mention that we have seen very similar behavior to this starting in the last few months.
We have several instances where the lock removal error is triggered by users running jobs. This is generally with the v5 beta, but several different commits. We have seen this intermittently with users running jobs as usual via the GUI, and frequently when triggering jobs to run via the relion schemer.
We see this on multiple client machines, and from several different storage servers running both NFS v3 and v4.
@biochem-fan I can reach out to them regarding this issue but it seems like this may be an issue for others as well given the note by @sdrawson .
@clil16 Please ask Dell. I believe it is file system's responsibility to keep the ordering of operations within a process. If they say it is not, please ask them what is guaranteed and what is not.
Just to make sure, aren't you using the async
option when mounting NFS? The default is sync
.
I will reach out to Dell and inquire.
We are using the async
option as that is default but we're working on testing the sync
option.
For reference, here are the mount options we specify on the client (The hostnames, paths, and IP addresses have been sanitized):
nfsserver.fqdn:/path/to/share on /nfs/share type nfs4 (rw,nosuid,nodev,noatime,nodiratime,vers=4.2,rsize=1047672,wsize=1047532,namlen=255,hard,proto=tcp,nconnect=8,timeo=600,retrans=5,sec=sys,clientaddr=xxx.xxx.xxx.xxx,local_lock=none,addr=xxx.xxx.xxx.xxx)
Just for the record, we just tried 'sync' option and it is a lot worse than 'async'. Relion fails with the "error in removing directory .relion_lock" almost instantly.
P.S. I am a coworker of @sdrawson.
We use auto,nofail,noatime,nolock,intr,tcp,lookupcache=pos
.
Recently we have been seeing people that had established RELION project directories lose the ability to open the RELION GUI in those project directories using the following versions of RELION: 4.0.1-commit-e5c483 and 5.0-beta-4-commit-33b2b0. Each time the behavior was the same but occasionally it would not occur.
We are using NFS v4.2 for our storage.
Environment:
Dataset:
Job options:
note.txt
in the job directory):Error message:
After following relion using strace I saw the following at the end of the strace output:
Once the program errors and exits the file
.relion_lock/lock_default_pipeline.star
is in fact removed (as the return code of 0 fromunlink
indicates) but it seems thermdir
on the directory is occurring too fast which is causing the error.I was able to fix this issue with the following patch for pipeliner.cpp on the 4.0.2 git tag :
I could create a pull request as well if you'd like.