ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
505 stars 112 forks source link

OSError: [Errno 37] No locks available #1299

Closed ScottMastro closed 7 months ago

ScottMastro commented 7 months ago

I am attempting to run cactus-pangenome on Slurm. I am running cactus v2.7.2 (legacy binary).

cactus-pangenome ./js samples.seqfile \
     --workDir ./scratch --coordinationDir ./locks \
     --outName hifi-v1.0-mc-chm13 --outDir hifi-v1.0-mc-chm13 \
     --reference CHM13 GRCh38 --filter 9 \
     --gfa --vcf  --viz \
     --odgi full filter  \
     --giraffe clip filter \
     --chrom-vg clip filter \
     --chrom-og --gbz clip filter full \
     --vcf --logFile hifi-v1.0-mc-chm13.log \
     --batchSystem slurm \
     --mgCores 1 --consCores 1 --indexCores 1

The error:

[2024-02-28T22:43:26-0500] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'sanitize_fasta_header' kind-sanitize_fasta_header/instance-gqsp8y1w v1
Exit reason: None
[2024-02-28T22:43:26-0500] [MainThread] [W] [toil.leader] No log file is present, despite job failing: 'sanitize_fasta_header' kind-sanitize_fasta_header/instance-gqsp8y1w v1
[2024-02-28T22:43:26-0500] [MainThread] [W] [toil.leader] The batch system left an empty file /scratch/toil_73f8ebad-f857-46ad-8558-9546a5d86960.62.9113807.out.log
[2024-02-28T22:43:26-0500] [MainThread] [W] [toil.leader] The batch system left a non-empty file scratch/toil_73f8ebad-f857-46ad-8558-9546a5d86960.62.9113807.err.log:
[2024-02-28T22:43:26-0500] [MainThread] [W] [toil.leader] Log from job "kind-sanitize_fasta_header/instance-gqsp8y1w" follows:
=========>
    Traceback (most recent call last):
      File "tools/legacy/cactus-bin-v2.7.2/cactus_env/bin/_toil_worker", line 10, in <module>
        sys.exit(main())
                 ^^^^^^
      File "tools/legacy/cactus-bin-v2.7.2/cactus_env/lib/python3.11/site-packages/toil/worker.py", line 729, in main
        with in_contexts(options.context):
      File "anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
        return next(self.gen)
               ^^^^^^^^^^^^^^
      File "tools/legacy/cactus-bin-v2.7.2/cactus_env/lib/python3.11/site-packages/toil/worker.py", line 703, in in_contexts
        with manager:
      File "tools/legacy/cactus-bin-v2.7.2/cactus_env/lib/python3.11/site-packages/toil/batchSystems/cleanup_support.py", line 75, in __enter__
        self.arena.enter()
      File "tools/legacy/cactus-bin-v2.7.2/cactus_env/lib/python3.11/site-packages/toil/lib/threading.py", line 481, in enter
        with global_mutex(self.base_dir, self.mutex):
      File "anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
        return next(self.gen)
               ^^^^^^^^^^^^^^
      File "tools/legacy/cactus-bin-v2.7.2/cactus_env/lib/python3.11/site-packages/toil/lib/threading.py", line 378, in global_mutex
        fcntl.lockf(fd, fcntl.LOCK_EX)
    OSError: [Errno 37] No locks available
<=========

By removing --batchSystem slurm, the sanitize_fasta_header job will run to completion. But I want to be able to actually use Slurm. It seems there's some complication between Slurm/Toil/Cactus that doesn't allow me to lock a file. I tried using --coordinationDir but it seems to have no effect.

Additionally, I tried using cactus v2.7.2 but with Python 3.7.10 and Toil 5.12.0 (instead of Python 11 and Toil 6.0.0) and I also get the same error. I'm wondering if it's an issue with Slurm...

And another clue as I continue to troubleshoot this. Setting --maxJobs 1 seems to fix it but again, this is not an ideal setting to be using.

Any advice or guidance is welcome.

glennhickey commented 7 months ago

What file system are you using? This does seem very related to #1289. As @adamnovak suggested, the fix is supposed to be specifying --coordinationDir. When you used it, did you set it to a path on the local, physical storage on your worker nodes?

ScottMastro commented 7 months ago

Thanks for the fast reply @glennhickey

The hardware knowledge is outside my specialty but I'll do my best to answer.

    Tier 1 storage - Isilon Gen6 platform, H500 & A2000 storage nodes
        cluster access to Tier 1 storage: NFSv3
        end user access to Tier 1 storage: SMB
    Archive storage - iRODS

NFSv3 is the filesystem. As recommended, I am not running cactus-pangenome as an sbatch job and instead, running it directly in the terminal after I ssh into the cluster.

I have my fasta.gzs and cactus in the active storage space. I make the directory ./scratch and ./locks as below:

image

in the ./example directory, I was testing with the primate dataset and was able to replicate the issue.

But basically I have no reason to believe the slurm jobs can't access either the ./scratch or ./locks directory

ScottMastro commented 7 months ago

I think I discovered the issue while digging in the Toil code. It looks like the primary issue is indeed the file system.

At first, I tried running the cactus-pangenome in a sbatch script. It was giving me issues (probably) because the slurm jobs that get triggered are trying to read and write to the main node. The way around that issue was to specify a common location all nodes can read and write to:

--workDir ./scratch --coordinationDir ./locks

THEN I found out you should run cactus-pangenome from the head node and not as an sbatch job (this seems to be a mistake multiple people have made so maybe it is worthwhile to emphasize or explain a bit more on the documentation page?). So now I run cactus-pangenome from the head node but kept --workDir

The problem seems to be rooted the Toil code below

    while True:
        # Try to create the file, ignoring if it exists or not.
        fd = os.open(lock_filename, os.O_CREAT | os.O_WRONLY)

        #time.sleep(1)        

        # Wait until we can exclusively lock it.
        fcntl.lockf(fd, fcntl.LOCK_EX)

I added the sleep line. Without the sleep, I get a "No locks available". With the sleep I get a "Stale file error". It looks like it happens during a cleanup step and that there is some competition to remove a certain file which doesn't play well with the NFS. At least, that's the best guess I have.

Either way, dropping --workDir seems to have solved this lock problem so that's the solution I'm going with. Thanks for the software by the way! Very useful for pangenomes

adamnovak commented 7 months ago

I would recommend not using directories under . for --workDir and --coordinationDir. I would recommend picking paths that exist on all the nodes, but which are not shared across nodes. Something like --worDir /tmp --coordinationDir /tmp, or the defaults when the options are not specified which are more or less that. You're right that all the workers are meant to compete over these files, and that that doesn't work well on some NFS setups (specifically the ones that are optimized for speed over global consistency of metadata, I think).

So it sounds like you got it working that way.

We might want some special Toil code to handle "no locks available" and stale file handle errors from NFS in some smarter way here? We might be able to retry on some of them, as long as the NFS setup actually does support locking.

ScottMastro commented 7 months ago

Thanks the the reply and suggestion! "Exist" but not "shared" makes a lot more sense. A different implementation could work or maybe easier (and potentially more useful) would just be a more helpful error message than the default "No locks available".

wjk1214 commented 6 months ago

https://github.com/conda/conda/issues/13534 FYI

adamnovak commented 6 months ago

@wjk1214 fcntl.fcntl and fcntl.lockf are different locking systems, and Toil doesn't use the lockf one that that issue talks about. But it does need somewhere where the fcntl.fcntl one works.

We should be able to handle this particular error, if it eventually goes away and we can eventually get a lock. I've opened https://github.com/DataBiosphere/toil/issues/4846 to track that.