ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

Cactus-pangenome fails at odgi squeeze when trying to open "nonexistent" file #1378

Closed briannadon closed 1 month ago

briannadon commented 1 month ago

I am running an alignment of about 90 human pangenome samples with cactus-pangenome, and the pipeline fails nearly every time at the odgi squeeze stage. My region is fairly small, about 50kbp, so I don't think it's a memory or resource problem. Oddly, about a week ago on the same machine on the same data and same install of cactus, it worked all the way through, so it seems like this is a problem with toil's job partitioning.

My command is:

cactus-pangenome ./js ./dr_pg_cactus.txt --outDir output --outName DR_pangenome \
    --reference grch38 --gfa --odgi --chrom-og --viz clip --clip 10000 \
    --clean onSuccess --cleanWorkDir onSuccess --scale 0.25 \
    --retryCount 3 --workDir ./workdir

And the relevant error in the output log is:

[2024-05-07T16:11:12-0700] [MainThread] [I] [toil.leader] Failed jobs at end of the run: 'join_vg' kind-Job/instance-k4h56grx v11 'sanitize_fasta_headers' kind-pangenome_end_to_end_workflow/instance-_ac25_c0 v6 'export_split_wrapper' kind-export_split_wrapper/instance-07hu1r2q v3 'Job' kind-export_graphmap_wrapper/instance-ub0c60ji v11 'odgi_squeeze' kind-odgi_squeeze/instance-2pwnsjkg v12 'mash_sketch' kind-minigraph_construct_workflow/instance-fdk26zzx v9 'Job' kind-minigraph_workflow/instance-v7tgs1ui v5 'Job' kind-Job/instance-p1tpjizi v2 'Job' kind-make_batch_align_jobs_wrapper/instance-0ineki5e v9 'graphmap_join_workflow' kind-export_align_wrapper/instance-o2m3e220 v6
[2024-05-07T16:11:12-0700] [MainThread] [I] [toil.realtimeLogger] Stopping real-time logging server.
[2024-05-07T16:11:12-0700] [MainThread] [I] [toil.realtimeLogger] Joining real-time logging server thread.
Traceback (most recent call last):
  File "/mnt/results/bnadon/cactus/cactus/cactus_venv/bin/cactus-pangenome", line 8, in <module>
    sys.exit(main())
  File "/mnt/results/bnadon/cactus/cactus/cactus_venv/lib/python3.9/site-packages/cactus/refmap/cactus_pangenome.py", line 220, in main
    toil.start(Job.wrapJobFn(pangenome_end_to_end_workflow, options, config_wrapper, input_seq_id_map, input_path_map, input_seq_order))
  File "/mnt/results/bnadon/cactus/cactus/cactus_venv/lib/python3.9/site-packages/toil/common.py", line 915, in start
    return self._runMainLoop(rootJobDescription)
  File "/mnt/results/bnadon/cactus/cactus/cactus_venv/lib/python3.9/site-packages/toil/common.py", line 1391, in _runMainLoop
    return Leader(config=self.config,
  File "/mnt/results/bnadon/cactus/cactus/cactus_venv/lib/python3.9/site-packages/toil/leader.py", line 295, in run
    raise FailedJobsException(self.jobStore, failed_jobs, exit_code=self.recommended_fail_exit_code)
toil.exceptions.FailedJobsException: The job store '/mnt/results/bnadon/cactus/drb_pan/js' contains 10 failed jobs: 'join_vg' kind-Job/instance-k4h56grx v11, 'sanitize_fasta_headers' kind-pangenome_end_to_end_workflow/instance-_ac25_c0 v6, 'export_split_wrapper' kind-export_split_wrapper/instance-07hu1r2q v3, 'Job' kind-export_graphmap_wrapper/instance-ub0c60ji v11, 'odgi_squeeze' kind-odgi_squeeze/instance-2pwnsjkg v12, 'mash_sketch' kind-minigraph_construct_workflow/instance-fdk26zzx v9, 'Job' kind-minigraph_workflow/instance-v7tgs1ui v5, 'Job' kind-Job/instance-p1tpjizi v2, 'Job' kind-make_batch_align_jobs_wrapper/instance-0ineki5e v9, 'graphmap_join_workflow' kind-export_align_wrapper/instance-o2m3e220 v6
Log from job "'odgi_squeeze' kind-odgi_squeeze/instance-2pwnsjkg v12" follows:
=========>
    [2024-05-07T16:09:45-0700] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
    [2024-05-07T16:09:45-0700] [MainThread] [I] [toil] Running Toil version 6.1.0-3f9cba3766e52866ea80d0934498f8c8f3129c3f on host tdx-davinci.
    [2024-05-07T16:09:45-0700] [MainThread] [I] [toil.worker] Working on job 'odgi_squeeze' kind-odgi_squeeze/instance-2pwnsjkg v10
    [2024-05-07T16:09:46-0700] [MainThread] [I] [toil.worker] Loaded body Job('odgi_squeeze' kind-odgi_squeeze/instance-2pwnsjkg v10) from description 'odgi_squeeze' kind-odgi_squeeze/instance-2pwnsjkg v10
    [2024-05-07T16:09:46-0700] [MainThread] [I] [cactus.shared.common] Work dirs: {'/mnt/results/bnadon/cactus/drb_pan/workdir/toilwf-55733a5be1cf5033b76e0a03c516bb60/7417/b0dd/tmpu364gnhz'}
    [2024-05-07T16:09:46-0700] [MainThread] [I] [cactus.shared.common] Docker work dir: /mnt/results/bnadon/cactus/drb_pan/workdir/toilwf-55733a5be1cf5033b76e0a03c516bb60/7417/b0dd/tmpu364gnhz
    [2024-05-07T16:09:46-0700] [MainThread] [I] [cactus.shared.common] Running the command ['docker', 'run', '--interactive', '--net=host', '--log-driver=none', '-u', '1010:1010', '-v', '/mnt/results/bnadon/cactus/drb_pan/workdir/toilwf-55733a5be1cf5033b76e0a03c516bb60/7417/b0dd/tmpu364gnhz:/data', '--entrypoint', '/opt/cactus/wrapper.sh', '--name', '0e52db0b-5374-4c0d-9289-2289b58fe719', '--rm', 'quay.io/comparative-genomics-toolkit/cactus:v2.8.1', 'odgi', 'squeeze', '-f', 'full.squeeze.input', '-o', 'full.og', '-t', '47']
    [2024-05-07T16:09:46-0700] [MainThread] [I] [toil-rt] 2024-05-07 16:09:46.303085: Running the command: "docker run --interactive --net=host --log-driver=none -u 1010:1010 -v /mnt/results/bnadon/cactus/drb_pan/workdir/toilwf-55733a5be1cf5033b76e0a03c516bb60/7417/b0dd/tmpu364gnhz:/data --entrypoint /opt/cactus/wrapper.sh --name 0e52db0b-5374-4c0d-9289-2289b58fe719 --rm quay.io/comparative-genomics-toolkit/cactus:v2.8.1 odgi squeeze -f full.squeeze.input -o full.og -t 47"
    [2024-05-07T16:09:46-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
    [2024-05-07T16:09:46-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-Job/instance-k4h56grx/file-b1f59e82e9fc4b4dad8f22b05b91366a/dr_region.vg.og' to path '/mnt/results/bnadon/cactus/drb_pan/workdir/toilwf-55733a5be1cf5033b76e0a03c516bb60/7417/b0dd/tmpu364gnhz/dr_region.full.og'
    Traceback (most recent call last):
      File "/mnt/results/bnadon/cactus/cactus/cactus_venv/lib/python3.9/site-packages/toil/worker.py", line 409, in workerScript
        job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
      File "/mnt/results/bnadon/cactus/cactus/cactus_venv/lib/python3.9/site-packages/toil/job.py", line 2845, in _runner
        returnValues = self._run(jobGraph=None, fileStore=fileStore)
      File "/mnt/results/bnadon/cactus/cactus/cactus_venv/lib/python3.9/site-packages/toil/job.py", line 2761, in _run
        return self.run(fileStore)
      File "/mnt/results/bnadon/cactus/cactus/cactus_venv/lib/python3.9/site-packages/toil/job.py", line 2990, in run
        rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
      File "/mnt/results/bnadon/cactus/cactus/cactus_venv/lib/python3.9/site-packages/cactus/refmap/cactus_graphmap_join.py", line 1067, in odgi_squeeze
        cactus_call(parameters=['odgi', 'squeeze', '-f', list_path, '-o', merged_path, '-t', str(job.cores)], job_memory=job.memory)
      File "/mnt/results/bnadon/cactus/cactus/cactus_venv/lib/python3.9/site-packages/cactus/shared/common.py", line 910, in cactus_call
        raise RuntimeError("{}Command {} exited {}: {}".format(sigill_msg, call, process.returncode, out))
    RuntimeError: Command ['docker', 'run', '--interactive', '--net=host', '--log-driver=none', '-u', '1010:1010', '-v', '/mnt/results/bnadon/cactus/drb_pan/workdir/toilwf-55733a5be1cf5033b76e0a03c516bb60/7417/b0dd/tmpu364gnhz:/data', '--entrypoint', '/opt/cactus/wrapper.sh', '--name', '0e52db0b-5374-4c0d-9289-2289b58fe719', '--rm', 'quay.io/comparative-genomics-toolkit/cactus:v2.8.1', 'odgi', 'squeeze', '-f', 'full.squeeze.input', '-o', 'full.og', '-t', '47'] exited 1: stderr=Running command catchsegv 'odgi' 'squeeze' '-f' 'full.squeeze.input' '-o' 'full.og' '-t' '47'
    [odgi::squeeze] error: the given file "/mnt/results/bnadon/cactus/drb_pan/workdir/toilwf-55733a5be1cf5033b76e0a03c516bb60/7417/b0dd/tmpu364gnhz/dr_region.full.og" does not exist. Please specify an existing input file in ODGI format via -i=[FILE], --idx=[FILE].

    [2024-05-07T16:09:46-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host tdx-davinci
<=========

The error is claiming /mnt/results/bnadon/cactus/drb_pan/workdir/toilwf-55733a5be1cf5033b76e0a03c516bb60/7417/b0dd/tmpu364gnhz/dr_region.full.og does not exist. However, when I look at my directory structure, that file does indeed exist.

Restarting the job with --restart does not work and results in the same error.

Any advice?

glennhickey commented 1 month ago

This is a bug -- thanks for raising it. You can work around it by not using --binariesMode docker. So either

briannadon commented 1 month ago

Thanks for the prompt reply. I actually did start a run entirely within docker before posting this issue and you are correct - it worked.