RELION race condition when opening GUI

clil16 commented 2 weeks ago

Recently we have been seeing people that had established RELION project directories lose the ability to open the RELION GUI in those project directories using the following versions of RELION: 4.0.1-commit-e5c483 and 5.0-beta-4-commit-33b2b0. Each time the behavior was the same but occasionally it would not occur.

We are using NFS v4.2 for our storage.

Environment:

OS: Red Hat Enterprise Linux 8.10
MPI runtime: OpenMPI 4.1.6
RELION version: 4.0.1-commit-e5c483, 5.0-beta-4-commit-33b2b0
Memory: N/A
GPU: N/A

Dataset:

Box size: N/A
Pixel size: N/A
Number of particles: N/A
Description: N/A

Job options:

Type of job: Launch GUI
Number of MPI processes: N/A
Number of threads: N/A
Full command (see note.txt in the job directory):
```
relion
```

Error message:

in: /tmp/build/relion/src/pipeliner.cpp, line 2341
ERROR:
ERROR: PipeLine:: write reported error in removing directory .relion_lock
=== Backtrace ===
relion(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_1+0x69) [0x47a639]
relion() [0x45a085]
relion (_ZN8PipeLine22checkProcessCompletionEv+0xbd9) [0x58a199]
relion (_ZN13GuiMainWindow14updateJobListsEv+0x10) [0x48a50]
relion (_ZN13GuiMainWindow23cb_select_browsegroup_iEb+0x91) [0x496021]
relion(_ZN13GuiMainWindowC2EiiPKc8FileNameiibbbb+0x30fa) [0x480aa]
relion (main+0x7fb) [0x466ccb]
/lib64/libc.so.6(__libc_start_main+0xe5) [0x7f17b4cd17e5]
relion(_start+0x2e) [0x47289e]
==================
ERROR:
ERROR: PipeLine: write reported error in removing directory .relion_lock

After following relion using strace I saw the following at the end of the strace output:

stat(".relion_lock/lock_default_pipeline.star", {st_mode=S_IFREG|0640, st_size=23, ...}) = 0
unlink(".relion_lock/lock_default_pipeline.star") = 0
rmdir(".relion_lock")                   = -1 ENOTEMPTY (Directory not empty)
futex(0x7f0a518230a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0

Once the program errors and exits the file .relion_lock/lock_default_pipeline.star is in fact removed (as the return code of 0 from unlink indicates) but it seems the rmdir on the directory is occurring too fast which is causing the error.

I was able to fix this issue with the following patch for pipeliner.cpp on the 4.0.2 git tag :

--- src/pipeliner.cpp.org   2024-11-07 13:53:10
+++ src/pipeliner.cpp   2024-11-07 13:58:01
@@ -2300,8 +2300,18 @@
            REPORT_ERROR("ERROR: PipeLine::write was expecting a file called "+fn_lock+ " but it is no longer there.");
        if (std::remove(fn_lock.c_str()))
            REPORT_ERROR("ERROR: PipeLine::write reported error in removing file "+fn_lock);
-       if (rmdir(dir_lock.c_str()))
-           REPORT_ERROR("ERROR: PipeLine::write reported error in removing directory "+dir_lock);
+       int iwait = 0;
+       int status = rmdir(dir_lock.c_str());
+       while (status != 0)
+       {
+           sleep(0.5);
+           int status = rmdir(dir_lock.c_str());
+           iwait++;
+           if (iwait > 9)
+           {
+               REPORT_ERROR("ERROR: Pipeline::write reported error in removing directory " + dir_lock + ".  Waited 5 seconds for the directory to be removed successfully.");
+           }
+       }
    }

    // Touch a file to indicate to the GUI that the pipeline has just changed

I could create a pull request as well if you'd like.

biochem-fan commented 1 week ago

This is very interesting. It is possible that a file system change on a host does not immediately become visible on other hosts but I think it is guaranteed by the POSIX specification that they are ordered within a process.

What is the backend of your NFS storage? Cannot you fix this by changing server or client options (e.g. caching parameter)? Because nobody has reported this error before, I suspect this is very specific to your system and we are reluctant to incorporate changes only to address sub-optimally configured system.

clil16 commented 1 week ago

Thank you for taking the time to look at this issue.

This is very interesting. It is possible that a file system change on a host does not immediately become visible on other hosts but I think it is guaranteed by the POSIX specification that they are ordered within a process.

In this case, I was just opening the RELION GUI so only a single host was involved. I agree it is very interesting and strange but as I understand it NFS is not POSIX compliant.

What is the backend of your NFS storage? Cannot you fix this by changing server or client options (e.g. caching parameter)? Because nobody has reported this error before, I suspect this is very specific to your system and we are reluctant to incorporate changes only to address sub-optimally configured system.

Our NFS storage system is a Dell PowerScale cluster (formerly known as an Isilon) that has been working extremely well for several years and no changes were made recently to any of the systems involved. I'm working on ways to test this more thoroughly but with the issue being intermittent it is difficult to recreate. I don't think our systems are sub-optimally configured but there may be something related to the cache causing weirdness.

biochem-fan commented 1 week ago

@clil16 Can you contact Dell support on the issue?

sdrawson commented 1 week ago

I just wanted to chip in and mention that we have seen very similar behavior to this starting in the last few months.

We have several instances where the lock removal error is triggered by users running jobs. This is generally with the v5 beta, but several different commits. We have seen this intermittently with users running jobs as usual via the GUI, and frequently when triggering jobs to run via the relion schemer.

We see this on multiple client machines, and from several different storage servers running both NFS v3 and v4.

clil16 commented 1 week ago

@biochem-fan I can reach out to them regarding this issue but it seems like this may be an issue for others as well given the note by @sdrawson .

biochem-fan commented 1 week ago

@clil16 Please ask Dell. I believe it is file system's responsibility to keep the ordering of operations within a process. If they say it is not, please ask them what is guaranteed and what is not.

biochem-fan commented 1 week ago

Just to make sure, aren't you using the async option when mounting NFS? The default is sync.

clil16 commented 1 week ago

I will reach out to Dell and inquire.

We are using the async option as that is default but we're working on testing the sync option.

For reference, here are the mount options we specify on the client (The hostnames, paths, and IP addresses have been sanitized): nfsserver.fqdn:/path/to/share on /nfs/share type nfs4 (rw,nosuid,nodev,noatime,nodiratime,vers=4.2,rsize=1047672,wsize=1047532,namlen=255,hard,proto=tcp,nconnect=8,timeo=600,retrans=5,sec=sys,clientaddr=xxx.xxx.xxx.xxx,local_lock=none,addr=xxx.xxx.xxx.xxx)

filonovd commented 3 days ago

Just for the record, we just tried 'sync' option and it is a lot worse than 'async'. Relion fails with the "error in removing directory .relion_lock" almost instantly.

P.S. I am a coworker of @sdrawson.

biochem-fan commented 3 days ago

We use auto,nofail,noatime,nolock,intr,tcp,lookupcache=pos.

3dem / relion

RELION race condition when opening GUI #1198