Closed ScottMastro closed 8 months ago
What file system are you using? This does seem very related to #1289. As @adamnovak suggested, the fix is supposed to be specifying --coordinationDir
. When you used it, did you set it to a path on the local, physical storage on your worker nodes?
Thanks for the fast reply @glennhickey
The hardware knowledge is outside my specialty but I'll do my best to answer.
Tier 1 storage - Isilon Gen6 platform, H500 & A2000 storage nodes
cluster access to Tier 1 storage: NFSv3
end user access to Tier 1 storage: SMB
Archive storage - iRODS
NFSv3 is the filesystem. As recommended, I am not running cactus-pangenome as an sbatch job and instead, running it directly in the terminal after I ssh into the cluster.
I have my fasta.gzs and cactus in the active storage space. I make the directory ./scratch and ./locks as below:
in the ./example directory, I was testing with the primate dataset and was able to replicate the issue.
But basically I have no reason to believe the slurm jobs can't access either the ./scratch or ./locks directory
I think I discovered the issue while digging in the Toil code. It looks like the primary issue is indeed the file system.
At first, I tried running the cactus-pangenome in a sbatch script. It was giving me issues (probably) because the slurm jobs that get triggered are trying to read and write to the main node. The way around that issue was to specify a common location all nodes can read and write to:
--workDir ./scratch --coordinationDir ./locks
THEN I found out you should run cactus-pangenome from the head node and not as an sbatch job (this seems to be a mistake multiple people have made so maybe it is worthwhile to emphasize or explain a bit more on the documentation page?). So now I run cactus-pangenome from the head node but kept --workDir
The problem seems to be rooted the Toil code below
while True:
# Try to create the file, ignoring if it exists or not.
fd = os.open(lock_filename, os.O_CREAT | os.O_WRONLY)
#time.sleep(1)
# Wait until we can exclusively lock it.
fcntl.lockf(fd, fcntl.LOCK_EX)
I added the sleep line. Without the sleep, I get a "No locks available". With the sleep I get a "Stale file error". It looks like it happens during a cleanup step and that there is some competition to remove a certain file which doesn't play well with the NFS. At least, that's the best guess I have.
Either way, dropping --workDir seems to have solved this lock problem so that's the solution I'm going with. Thanks for the software by the way! Very useful for pangenomes
I would recommend not using directories under .
for --workDir
and --coordinationDir
. I would recommend picking paths that exist on all the nodes, but which are not shared across nodes. Something like --worDir /tmp --coordinationDir /tmp
, or the defaults when the options are not specified which are more or less that. You're right that all the workers are meant to compete over these files, and that that doesn't work well on some NFS setups (specifically the ones that are optimized for speed over global consistency of metadata, I think).
So it sounds like you got it working that way.
We might want some special Toil code to handle "no locks available" and stale file handle errors from NFS in some smarter way here? We might be able to retry on some of them, as long as the NFS setup actually does support locking.
Thanks the the reply and suggestion! "Exist" but not "shared" makes a lot more sense. A different implementation could work or maybe easier (and potentially more useful) would just be a more helpful error message than the default "No locks available".
@wjk1214 fcntl.fcntl
and fcntl.lockf
are different locking systems, and Toil doesn't use the lockf
one that that issue talks about. But it does need somewhere where the fcntl.fcntl
one works.
We should be able to handle this particular error, if it eventually goes away and we can eventually get a lock. I've opened https://github.com/DataBiosphere/toil/issues/4846 to track that.
I am attempting to run cactus-pangenome on Slurm. I am running cactus v2.7.2 (legacy binary).
The error:
By removing
--batchSystem slurm
, the sanitize_fasta_header job will run to completion. But I want to be able to actually use Slurm. It seems there's some complication between Slurm/Toil/Cactus that doesn't allow me to lock a file. I tried using--coordinationDir
but it seems to have no effect.Additionally, I tried using cactus v2.7.2 but with Python 3.7.10 and Toil 5.12.0 (instead of Python 11 and Toil 6.0.0) and I also get the same error. I'm wondering if it's an issue with Slurm...
And another clue as I continue to troubleshoot this. Setting
--maxJobs 1
seems to fix it but again, this is not an ideal setting to be using.Any advice or guidance is welcome.