ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
486 stars 107 forks source link

FileNotFoundError in minigraph_map_one #1289

Open Han-Cao opened 5 months ago

Han-Cao commented 5 months ago

Hi,

I am running the mc pangenome pipeline on SLURM. Among the minigraph_map_one jobs, a few of them have error like FileNotFoundError: [Errno 2] No such file or directory: tmp/......./deferred and FileNotFoundError: [Errno 2] No such file or directory: tmp/......cleanup-arena-members. Below is the log of a failed job.

Those tmp file and directory do exist on the server and should be used by all minigraph_map_one jobs. It seems wired that only some jobs fail while the other jobs run without any error. The jobs on all computation nodes can have this issue, so it is not due to the error of specific node.

Besides, this error can be resolved by re-run the job multiple times. I finally complete this step by re-run it 6-7 times.

Do you have any ideas on this issue?

The job seems to have left a log file, indicating failure: 'minigraph_map_one' kind-minigraph_map_one/instance-01nbxd0o v8
Log from job "kind-minigraph_map_one/instance-01nbxd0o" follows:
=========>
    [2024-02-22T09:46:17+0800] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
    [2024-02-22T09:46:17+0800] [MainThread] [I] [toil] Running Toil version 6.0.0-0e2a07a20818e593bfdfde3cc51ca4ad809fde96 on host cpu07.hpc.cluster.
    [2024-02-22T09:46:17+0800] [MainThread] [I] [toil.worker] Working on job 'minigraph_map_one' kind-minigraph_map_one/instance-01nbxd0o v7
    [2024-02-22T09:46:18+0800] [MainThread] [I] [toil.worker] Loaded body Job('minigraph_map_one' kind-minigraph_map_one/instance-01nbxd0o v7) from description 'minigraph_map_one' kind-minigraph_map_one/instance-01nbxd0o v7
    [2024-02-22T09:48:44+0800] [MainThread] [I] [cactus.shared.common] Running the command ['bash', '-c', 'set -eo pipefail && minigraph /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/mg.gfa /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/TAD583.2.fa -o /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/TAD583.2.gaf -c -xasm -t 8']
    [2024-02-22T09:48:44+0800] [MainThread] [I] [toil-rt] 2024-02-22 09:48:44.095508: Running the command: "bash -c set -eo pipefail && minigraph /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/mg.gfa /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/TAD583.2.fa -o /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/TAD583.2.gaf -c -xasm -t 8"
    [2024-02-22T10:08:38+0800] [MainThread] [W] [toil.lib.humanize] Deprecated toil method.  Please use "toil.lib.conversions.bytes2human()" instead."
    [2024-02-22T10:08:38+0800] [MainThread] [I] [toil-rt] 2024-02-22 10:08:38.644270: Successfully ran: "bash -c set -eo pipefail && minigraph /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/mg.gfa /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/TAD583.2.fa -o /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/TAD583.2.gaf -c -xasm -t 8" in 1194.524 seconds with job-memory 211.5 Gi
    [2024-02-22T10:08:38+0800] [MainThread] [I] [toil-rt] 2024-02-22 10:08:38.648067: Running the command: "bash -c set -eo pipefail && gaf2unstable /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/TAD583.2.gaf -g /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/mg.gfa -o /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/mg.gfa.node_lengths.tsv | gaffilter - -r 5.0 -m 0.25 -q 5 -b 250000 -o 0 -i 0.5"
    [2024-02-22T10:09:04+0800] [MainThread] [W] [toil.lib.humanize] Deprecated toil method.  Please use "toil.lib.conversions.bytes2human()" instead."
    [2024-02-22T10:09:04+0800] [MainThread] [I] [toil-rt] 2024-02-22 10:09:04.650072: Successfully ran: "bash -c set -eo pipefail && gaf2unstable /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/TAD583.2.gaf -g /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/mg.gfa -o /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/mg.gfa.node_lengths.tsv | gaffilter - -r 5.0 -m 0.25 -q 5 -b 250000 -o 0 -i 0.5" in 25.9914 seconds with job-memory 211.5 Gi
    [2024-02-22T10:09:04+0800] [MainThread] [I] [toil-rt] 2024-02-22 10:09:04.685198: Running the command: "bash -c set -eo pipefail && gaf2paf /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/TAD583.2.gaf.unstable -l /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/mg.gfa.node_lengths.tsv | awk 'BEGIN{OFS=" "} {$6="id=_MINIGRAPH_|"$6; print}'"
    [2024-02-22T10:09:07+0800] [MainThread] [W] [toil.lib.humanize] Deprecated toil method.  Please use "toil.lib.conversions.bytes2human()" instead."
    [2024-02-22T10:09:07+0800] [MainThread] [I] [toil-rt] 2024-02-22 10:09:07.059492: Successfully ran: "bash -c set -eo pipefail && gaf2paf /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/TAD583.2.gaf.unstable -l /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/ef91/db4a/tmp2_c42imh/mg.gfa.node_lengths.tsv | awk 'BEGIN{OFS="    "} {$6="id=_MINIGRAPH_|"$6; print}'" in 2.3636 seconds with job-memory 211.5 Gi
    Traceback (most recent call last):
      File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.7.1/venv-cactus-v2.7.1/lib/python3.10/site-packages/toil/worker.py", line 393, in workerScript
        with deferredFunctionManager.open() as defer:
      File "/beegfs/userhome/hcaoad/.conda/envs/pangenome/lib/python3.10/contextlib.py", line 142, in __exit__
        next(self.gen)
      File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.7.1/venv-cactus-v2.7.1/lib/python3.10/site-packages/toil/deferred.py", line 193, in open
        self._runOrphanedDeferredFunctions()
      File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.7.1/venv-cactus-v2.7.1/lib/python3.10/site-packages/toil/deferred.py", line 285, in _runOrphanedDeferredFunctions
        for filename in os.listdir(self.stateDir):
    FileNotFoundError: [Errno 2] No such file or directory: '/beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/deferred'
    [2024-02-22T10:09:08+0800] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host cpu07.hpc.cluster
<=========
The batch system left a non-empty file log/05.cactus-pangenome-batch/toil_3ef730dc-1ebe-473b-8882-c8a89b9caf85.7.76489.err.log:
Log from job "kind-minigraph_map_one/instance-01nbxd0o" follows:
=========>
    Traceback (most recent call last):
      File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.7.1/venv-cactus-v2.7.1/bin/_toil_worker", line 8, in <module>
        sys.exit(main())
      File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.7.1/venv-cactus-v2.7.1/lib/python3.10/site-packages/toil/worker.py", line 723, in main
        with in_contexts(options.context):
      File "/beegfs/userhome/hcaoad/.conda/envs/pangenome/lib/python3.10/contextlib.py", line 142, in __exit__
        next(self.gen)
      File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.7.1/venv-cactus-v2.7.1/lib/python3.10/site-packages/toil/worker.py", line 697, in in_contexts
        with manager:
      File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.7.1/venv-cactus-v2.7.1/lib/python3.10/site-packages/toil/batchSystems/cleanup_support.py", line 85, in __exit__
        for _ in self.arena.leave():
      File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.7.1/venv-cactus-v2.7.1/lib/python3.10/site-packages/toil/lib/threading.py", line 575, in leave
        for item in os.listdir(self.lockfileDir):
    FileNotFoundError: [Errno 2] No such file or directory: '/beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/3ef730dc-1ebe-473b-8882-c8a89b9caf85-cleanup-arena-members'
glennhickey commented 5 months ago

I'm not sure. I know that Toil has trouble with some types of filesystems due to latency (@adamnovak may be able to shed light). One thing that may help is to use --workDir or TMPDIR to set the working directory to local storage on your node. The jobstore needs to be on a shared drive accessible by all nodes, but ideally --workDir will point to a local, non-network drive.

Han-Cao commented 5 months ago

Hi @glennhickey,

Thanks for your quick reply. To set the working directory, can I just use the default setting of --workDir? If I understand correctly, it will use the default tmp directory on each node.

I found some jobs can request large disk space (see below log), which may exceed the local disk quota of the computation node. If there are some jobs failed due to insufficient disk space / memory, can I copy the jobstore to another cluster to continue the analysis? I currently run the pipeline on a cluster with many CPU nodes, while there is another one has some nodes with larger disk and memory.

Issued job 'cactus_cons' kind-cactus_cons/instance-i9dudq3x v1 with job batch system ID: 5970 and disk: 203.8 Gi, memory: 356.7 Gi, cores: 64, accelerators: [], preemptible: False

updates: I found that after I change --workDir and restart the pipeline from jobstore, the new jobs still use the old --workDir. Is it possible to change the parameters of one job when restart it?

glennhickey commented 5 months ago

Yeah, workdir will default to $TMPDIR if it's not specified. If $TMPDIR isn't set (or $TEMPDIR) it'll probably try /tmp.

For copying the jobstore, it usually works so long as all paths remain valid across both systems.

Han-Cao commented 4 months ago

Hi @glennhickey ,

After more attempts, I found this issue is likely due to that some jobs remove the temporary directory:

I am wondering if this is an error of Toil or it only happens to specific file system. The file system I am using is BeeGFS.

2024-02-26 16:39:31.908599: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/63be/c4c3/tmpf9gde10k/chrM.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:39:34.082925: Successfully ran: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/63be/c4c3/tmpf9gde10k/chrM.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13" in 2.1653 seconds with job-memory 2.0 Gi
2024-02-26 16:39:34.084705: Running the command: "vg validate /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/63be/c4c3/tmpf9gde10k/chrM.vg.clip"
2024-02-26 16:39:34.249867: Successfully ran: "vg validate /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/63be/c4c3/tmpf9gde10k/chrM.vg.clip" in 0.1615 seconds
2024-02-26 16:41:02.707416: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/672c/9b45/tmp_oc3v9cs/chr15.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:41:08.414905: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/10bd/3b9e/tmphowz5n5e/chr21.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:41:20.233867: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/6d41/f2bd/tmpag1u4gfr/chr20.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:41:21.075434: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/5b3c/b95c/tmpgkvjc26y/chrX.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:41:33.509710: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/679b/862c/tmph82pgj1a/chr17.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:41:33.708608: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/b2c6/1274/tmp2iqerfgv/chr18.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:41:50.590927: Running the command: "vg convert -f -Q CHM13 chr20.vg -B"
2024-02-26 16:41:55.255537: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/4d1e/9f9f/tmp475w5h92/chr16.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:42:11.614775: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/4284/7efc/tmpqm99j5wf/chr14.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:42:19.741044: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/8152/93ad/tmphi55arsv/chr4.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:42:21.170092: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/bdf4/aa0c/tmpsp6vldcn/chr19.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:42:34.218217: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/dc65/452b/tmp4d6s_gjw/chr2.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:42:37.665396: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/287b/5072/tmpfb9rtp2r/chr22.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:42:45.373441: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/63ef/eab1/tmpz8p0retl/chr12.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:42:54.287884: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/3cdb/4f96/tmp6k34u2pi/chr11.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:43:12.043061: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/0b90/870e/tmpfpxhxd48/chr13.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:43:12.850625: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/4983/f681/tmpd0rxrgas/chr10.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:43:17.523083: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/a04c/be20/tmptodx_5ky/chr9.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:43:25.948724: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/8425/b46f/tmpa4p1nmhn/chr3.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:43:26.224188: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/6a63/37dd/tmpucmnqco7/chr8.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:43:39.669467: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/57a5/79c2/tmpag6g1k7a/chr5.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:43:42.157875: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/48e9/43e7/tmp80zxf8yl/chr6.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:43:47.830211: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/8448/fea1/tmp21sbu1mp/chr1.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:43:52.858938: Running the command: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/4873/dfb3/tmp4gjp8yds/chr7.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13"
2024-02-26 16:43:57.460661: Successfully ran: "vg convert -f -Q CHM13 chr20.vg -B" in 126.8656 seconds with job-memory 35.7 Gi
2024-02-26 16:45:10.074683: Running the command: "vg convert -f -Q CHM13 chr4.vg -B"
2024-02-26 16:47:50.176583: Successfully ran: "vg convert -f -Q CHM13 chr4.vg -B" in 160.0942 seconds with job-memory 83.1 Gi
2024-02-26 16:48:57.704582: Running the command: "vg convert -f -Q CHM13 chr22.vg -B"
2024-02-26 16:50:02.769211: Successfully ran: "vg convert -f -Q CHM13 chr22.vg -B" in 65.0582 seconds with job-memory 39.6 Gi
2024-02-26 16:50:26.854185: Running the command: "vg convert -f -Q CHM13 chrM.vg -B"
2024-02-26 16:50:27.158834: Successfully ran: "vg convert -f -Q CHM13 chrM.vg -B" in 0.2933 seconds with job-memory 2.0 Gi
2024-02-26 16:51:09.365278: Running the command: "vg convert -f -Q CHM13 chr10.vg -B"
2024-02-26 16:52:57.430023: Successfully ran: "vg convert -f -Q CHM13 chr10.vg -B" in 108.0583 seconds with job-memory 63.1 Gi
2024-02-26 16:54:10.043125: Running the command: "vg convert -f -Q CHM13 chr7.vg -B"
2024-02-26 16:56:28.361318: Successfully ran: "vg convert -f -Q CHM13 chr7.vg -B" in 138.3119 seconds with job-memory 74.4 Gi
2024-02-26 16:57:52.052650: Running the command: "vg convert -f -Q CHM13 chr9.vg -B"
2024-02-26 16:59:51.345910: Successfully ran: "vg convert -f -Q CHM13 chr9.vg -B" in 119.2871 seconds with job-memory 70.0 Gi
2024-02-26 17:00:36.733496: Successfully ran: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/10bd/3b9e/tmphowz5n5e/chr21.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13" in 1168.253 seconds with job-memory 34.1 Gi
2024-02-26 17:00:36.756499: Running the command: "vg validate /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/10bd/3b9e/tmphowz5n5e/chr21.vg.clip"
2024-02-26 17:00:59.150064: Running the command: "vg convert -f -Q CHM13 chr16.vg -B"
2024-02-26 17:01:06.209574: Successfully ran: "vg validate /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/10bd/3b9e/tmphowz5n5e/chr21.vg.clip" in 29.4461 seconds
2024-02-26 17:01:54.333900: Running the command: "vg convert -f -Q CHM13 chr18.vg -B"
2024-02-26 17:02:17.431845: Successfully ran: "vg convert -f -Q CHM13 chr16.vg -B" in 78.2751 seconds with job-memory 47.9 Gi
2024-02-26 17:02:50.851151: Successfully ran: "vg convert -f -Q CHM13 chr18.vg -B" in 56.5111 seconds with job-memory 35.6 Gi
2024-02-26 17:03:30.421374: Running the command: "vg convert -f -Q CHM13 chr12.vg -B"
2024-02-26 17:03:44.673152: Running the command: "vg convert -f -Q CHM13 chrX.vg -B"
2024-02-26 17:05:03.479028: Successfully ran: "vg convert -f -Q CHM13 chrX.vg -B" in 78.7992 seconds with job-memory 42.0 Gi
2024-02-26 17:05:12.876115: Successfully ran: "vg convert -f -Q CHM13 chr12.vg -B" in 102.4483 seconds with job-memory 58.0 Gi
2024-02-26 17:06:36.579530: Running the command: "vg convert -f -Q CHM13 chr1.vg -B"
2024-02-26 17:06:37.345167: Running the command: "vg convert -f -Q CHM13 chr8.vg -B"
2024-02-26 17:08:35.924742: Successfully ran: "vg convert -f -Q CHM13 chr8.vg -B" in 118.5726 seconds with job-memory 64.8 Gi
2024-02-26 17:09:51.412122: Running the command: "vg convert -f -Q CHM13 chr11.vg -B"
2024-02-26 17:09:58.451573: Successfully ran: "vg convert -f -Q CHM13 chr1.vg -B" in 201.8654 seconds with job-memory 108.4 Gi
2024-02-26 17:10:52.383899: Successfully ran: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/b2c6/1274/tmp2iqerfgv/chr18.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13" in 1758.5945 seconds with job-memory 48.9 Gi
2024-02-26 17:10:52.396398: Running the command: "vg validate /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/b2c6/1274/tmp2iqerfgv/chr18.vg.clip"
2024-02-26 17:11:05.877312: Successfully ran: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/6d41/f2bd/tmpag1u4gfr/chr20.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13" in 1785.5845 seconds with job-memory 49.1 Gi
2024-02-26 17:11:05.895307: Running the command: "vg validate /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/6d41/f2bd/tmpag1u4gfr/chr20.vg.clip"
2024-02-26 17:11:30.220293: Running the command: "vg convert -f -Q CHM13 chr14.vg -B"
2024-02-26 17:11:32.003119: Successfully ran: "vg convert -f -Q CHM13 chr11.vg -B" in 100.5843 seconds with job-memory 59.2 Gi
2024-02-26 17:11:35.888857: Successfully ran: "vg validate /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/b2c6/1274/tmp2iqerfgv/chr18.vg.clip" in 43.4864 seconds
2024-02-26 17:11:46.507269: Successfully ran: "vg validate /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/6d41/f2bd/tmpag1u4gfr/chr20.vg.clip" in 40.5893 seconds
2024-02-26 17:12:07.950171: Successfully ran: "bash -c set -eo pipefail && vg clip /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/bdf4/aa0c/tmpsp6vldcn/chr19.vg -d 21 -P CHM13 -P GRCh38 -m 1000 | vg clip -d 1 - -P CHM13 -P GRCh38 | vg clip -sS - -P CHM13" in 1786.7416 seconds with job-memory 51.7 Gi
2024-02-26 17:12:07.952026: Running the command: "vg validate /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/bdf4/aa0c/tmpsp6vldcn/chr19.vg.clip"
Job failed with exit value 1: 'vg_to_gfa' kind-vg_to_gfa/instance-fl8nm46z v1
Exit reason: None
Despite the batch system claiming failure the job 'vg_to_gfa' kind-vg_to_gfa/instance-fl8nm46z v1 seems to have finished and been removed
Job failed with exit value 1: 'vg_clip_vg' kind-vg_clip_vg/instance-tbhxirs_ v1
Exit reason: None
Despite the batch system claiming failure the job 'vg_clip_vg' kind-vg_clip_vg/instance-tbhxirs_ v1 seems to have finished and been removed
2024-02-26 17:12:52.646236: Successfully ran: "vg validate /beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/bdf4/aa0c/tmpsp6vldcn/chr19.vg.clip" in 44.6871 seconds
2024-02-26 17:12:55.666479: Running the command: "vg convert -f -Q CHM13 chr15.vg -B"
2024-02-26 17:13:10.564111: Running the command: "vg convert -f -Q CHM13 chr3.vg -B"
2024-02-26 17:13:12.573754: Successfully ran: "vg convert -f -Q CHM13 chr14.vg -B" in 102.3458 seconds with job-memory 54.7 Gi
2024-02-26 17:13:16.454668: Running the command: "vg convert -f -Q CHM13 chr6.vg -B"
Job failed with exit value 1: 'vg_to_gfa' kind-vg_to_gfa/instance-6q03f7fr v1
Exit reason: None
The job seems to have left a log file, indicating failure: 'vg_to_gfa' kind-vg_to_gfa/instance-6q03f7fr v2
Log from job "kind-vg_to_gfa/instance-6q03f7fr" follows:
=========>
    [2024-02-26T17:10:57+0800] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
    [2024-02-26T17:10:58+0800] [MainThread] [I] [toil] Running Toil version 6.0.0-0e2a07a20818e593bfdfde3cc51ca4ad809fde96 on host cpu06.hpc.cluster.
    [2024-02-26T17:10:58+0800] [MainThread] [I] [toil.worker] Working on job 'vg_to_gfa' kind-vg_to_gfa/instance-6q03f7fr v1
    [2024-02-26T17:10:59+0800] [MainThread] [I] [toil.worker] Loaded body Job('vg_to_gfa' kind-vg_to_gfa/instance-6q03f7fr v1) from description 'vg_to_gfa' kind-vg_to_gfa/instance-6q03f7fr v1
    [2024-02-26T17:11:30+0800] [MainThread] [I] [cactus.shared.common] Running the command ['vg', 'convert', '-f', '-Q', 'CHM13', 'chr14.vg', '-B']
    [2024-02-26T17:11:30+0800] [MainThread] [I] [toil-rt] 2024-02-26 17:11:30.220293: Running the command: "vg convert -f -Q CHM13 chr14.vg -B"
    [2024-02-26T17:13:12+0800] [MainThread] [W] [toil.lib.humanize] Deprecated toil method.  Please use "toil.lib.conversions.bytes2human()" instead."
    [2024-02-26T17:13:12+0800] [MainThread] [I] [toil-rt] 2024-02-26 17:13:12.573754: Successfully ran: "vg convert -f -Q CHM13 chr14.vg -B" in 102.3458 seconds with job-memory 54.7 Gi
    Traceback (most recent call last):
      File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.7.1/venv-cactus-v2.7.1/lib/python3.10/site-packages/toil/worker.py", line 393, in workerScript
        with deferredFunctionManager.open() as defer:
      File "/beegfs/userhome/hcaoad/.conda/envs/pangenome/lib/python3.10/contextlib.py", line 142, in __exit__
        next(self.gen)
      File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.7.1/venv-cactus-v2.7.1/lib/python3.10/site-packages/toil/deferred.py", line 193, in open
        self._runOrphanedDeferredFunctions()
      File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.7.1/venv-cactus-v2.7.1/lib/python3.10/site-packages/toil/deferred.py", line 285, in _runOrphanedDeferredFunctions
        for filename in os.listdir(self.stateDir):
    FileNotFoundError: [Errno 2] No such file or directory: '/beegfs/userhome/hcaoad/project/AD_LRS/pangenome/tmp/2e59a124d469578aa574ce7e253850fa/deferred'
    [2024-02-26T17:13:37+0800] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host cpu06.hpc.cluster
<=========
The batch system left a non-empty file log/06.cactus-pangenome-batch/toil_3ef730dc-1ebe-473b-8882-c8a89b9caf85.39.76550.err.log:
Log from job "kind-vg_to_gfa/instance-6q03f7fr" follows:
=========>
    XDG_RUNTIME_DIR is set to nonexistent directory /run/user/41003; your environment may be out of spec!
    [2024-02-22T13:22:46+0800] [MainThread] [W] [toil.common] XDG_RUNTIME_DIR is set to nonexistent directory /run/user/41003; your environment may be out of spec!
<=========
The batch system left an empty file log/06.cactus-pangenome-batch/toil_3ef730dc-1ebe-473b-8882-c8a89b9caf85.39.83181.out.log
The batch system left an empty file log/06.cactus-pangenome-batch/toil_3ef730dc-1ebe-473b-8882-c8a89b9caf85.39.83038.out.log
The batch system left an empty file log/06.cactus-pangenome-batch/toil_3ef730dc-1ebe-473b-8882-c8a89b9caf85.39.75924.out.log
The batch system left a non-empty file log/06.cactus-pangenome-batch/toil_3ef730dc-1ebe-473b-8882-c8a89b9caf85.39.83038.err.log:
Log from job "kind-vg_to_gfa/instance-6q03f7fr" follows:
=========>
    [2024-02-25T00:37:08+0800] [MainThread] [W] [toil.common] XDG_RUNTIME_DIR is set to nonexistent directory /run/user/41003; your environment may be out of spec!
    [2024-02-25T00:37:08+0800] [MainThread] [W] [toil.common] XDG_RUNTIME_DIR is set to nonexistent directory /run/user/41003; your environment may be out of spec!
<=========
The batch system left an empty file log/06.cactus-pangenome-batch/toil_3ef730dc-1ebe-473b-8882-c8a89b9caf85.39.82614.err.log
The batch system left an empty file log/06.cactus-pangenome-batch/toil_3ef730dc-1ebe-473b-8882-c8a89b9caf85.39.82614.out.log
The batch system left an empty file log/06.cactus-pangenome-batch/toil_3ef730dc-1ebe-473b-8882-c8a89b9caf85.39.83181.err.log
The batch system left an empty file log/06.cactus-pangenome-batch/toil_3ef730dc-1ebe-473b-8882-c8a89b9caf85.39.75924.err.log
The batch system left an empty file log/06.cactus-pangenome-batch/toil_3ef730dc-1ebe-473b-8882-c8a89b9caf85.39.76550.out.log
Due to failure we are reducing the remaining try count of job 'vg_to_gfa' kind-vg_to_gfa/instance-6q03f7fr v2 with ID kind-vg_to_gfa/instance-6q03f7fr to 1
adamnovak commented 4 months ago

This smells like BeeGFS might not actually implement the fcntl-based file locking that Toil uses. Each job thinks it is responsible for cleaning up that directory because it can't tell that the other jobs are still alive and holding locks on the files in it.

You should set the --coordinationDir to filesystems that implement file locking. You might also need to set the --workDirsimilarly, but we are now meant to be able to get away with that being on weirder storage as long as the --coordinationDir is well-behaved.