ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
526 stars 111 forks source link

Job used more disk than requested. #1494

Open jeramiahsmith opened 1 month ago

jeramiahsmith commented 1 month ago

I had an alignment of 5 large chromosomes that terminated apparently die to writing a file that was too large (1.1 Tb). We are able to write files larger than this on the disks we are using so I wanted to raise this as a potential issue with the code whiel I explore it further. I am including what I think is enough of the run log below.

[2024-09-28T15:46:04-0400] [MainThread] [I] [toil-rt] 2024-09-28 15:46:04.873937: Successfully ran: "paffy filter -i /scratch/jjsmit3/compar/cactus2/tmp/toilwf-6e7bff2e77ae58f6badf2109cb081918/d419/job/tmp155e9ymy/primary_chain_L.paf --minChainScore 10000" in 32.2058 seconds with job-memory 3.9 Ti [2024-09-28T15:46:22-0400] [MainThread] [I] [toil.leader] Finished toil run with 8 failed jobs. [2024-09-28T15:46:22-0400] [MainThread] [I] [toil.leader] Failed jobs at end of the run: 'chain_one_alignment' kind-chain_one_alignment/instance-9c2cwbjx v6 'Job' kind-chain_alignments/instance-8z2s7iwm v4 'Job' kind-make_paf_alignments/instance-ev2qi46u v4 'EncapsulatedJob' kind-EncapsulatedJob/instance-j2jsdcps v2 'progressive_step' kind-progressive_schedule/instance-q1fcutwd v7 'chain_alignments_splitting_ingroups_and_outgroups' kind-chain_alignments_splitting_ingroups_and_outgroups/instance-be7s_wqo v3 'sanitize_fasta_headers' kind-progressive_workflow/instance-_sc6ytkj v4 'Job' kind-preprocess_all/instance-fjx98l78 v5 [2024-09-28T15:46:22-0400] [MainThread] [I] [toil.realtimeLogger] Stopping real-time logging server. [2024-09-28T15:46:22-0400] [MainThread] [I] [toil.realtimeLogger] Joining real-time logging server thread. Traceback (most recent call last): File "/scratch/jjsmit3/bin/cactus-bin-v2.9.0/venv-cactus-v2.9.0/bin/cactus", line 8, in sys.exit(main()) File "/scratch/jjsmit3/bin/cactus-bin-v2.9.0/venv-cactus-v2.9.0/lib/python3.10/site-packages/cactus/progressive/cactus_progressive.py", line 455, in main hal_id = toil.start(Job.wrapJobFn(progressive_workflow, options, config_node, mc_tree, og_map, input_seq_id_map)) File "/scratch/jjsmit3/bin/cactus-bin-v2.9.0/venv-cactus-v2.9.0/lib/python3.10/site-packages/toil/common.py", line 930, in start return self._runMainLoop(rootJobDescription) File "/scratch/jjsmit3/bin/cactus-bin-v2.9.0/venv-cactus-v2.9.0/lib/python3.10/site-packages/toil/common.py", line 1417, in _runMainLoop jobCache=self._jobCache).run() File "/scratch/jjsmit3/bin/cactus-bin-v2.9.0/venv-cactus-v2.9.0/lib/python3.10/site-packages/toil/leader.py", line 304, in run raise FailedJobsException(self.jobStore, failed_jobs, exit_code=self.recommended_fail_exit_code) toil.exceptions.FailedJobsException: The job store '/scratch/jjsmit3/compar/cactus2/js' contains 8 failed jobs: 'chain_one_alignment' kind-chain_one_alignment/instance-9c2cwbjx v6, 'Job' kind-chain_alignments/instance-8z2s7iwm v4, 'Job' kind-make_paf_alignments/instance-ev2qi46u v4, 'EncapsulatedJob' kind-EncapsulatedJob/instance-j2jsdcps v2, 'progressive_step' kind-progressive_schedule/instance-q1fcutwd v7, 'chain_alignments_splitting_ingroups_and_outgroups' kind-chain_alignments_splitting_ingroups_and_outgroups/instance-be7s_wqo v3, 'sanitize_fasta_headers' kind-progressive_workflow/instance-_sc6ytkj v4, 'Job' kind-preprocess_all/instance-fjx98l78 v5 Log from job "'chain_one_alignment' kind-chain_one_alignment/instance-9c2cwbjx v6" follows: =========> [2024-09-27T15:41:35-0400] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG--- [2024-09-27T15:41:35-0400] [MainThread] [I] [toil] Running Toil version 7.0.0-d569ea5711eb310ffd5703803f7250ebf7c19576 on host frome001. [2024-09-27T15:41:35-0400] [MainThread] [I] [toil.worker] Working on job 'chain_one_alignment' kind-chain_one_alignment/instance-9c2cwbjx v4 [2024-09-27T15:41:35-0400] [MainThread] [I] [toil.worker] Loaded body Job('chain_one_alignment' kind-chain_one_alignment/instance-9c2cwbjx v4) from description 'chain_one_alignment' kind-chain_one_alignment/instance-9c2cwbjx v4 [2024-09-27T15:49:05-0400] [MainThread] [I] [cactus.shared.common] Running the command ['paffy', 'invert', '-i', '/scratch/jjsmit3/compar/cactus2/tmp/toilwf-6e7bff2e77ae58f6badf2109cb081918/3903/job/tmpzyr_2xvq/L-LH_vs_LV.inv.paf'] [2024-09-27T15:49:05-0400] [MainThread] [I] [toil-rt] 2024-09-27 15:49:05.111666: Running the command: "paffy invert -i /scratch/jjsmit3/compar/cactus2/tmp/toilwf-6e7bff2e77ae58f6badf2109cb081918/3903/job/tmpzyr_2xvq/L-LH_vs_LV.inv.paf" [2024-09-27T21:07:42-0400] [MainThread] [W] [toil.lib.humanize] Deprecated toil method. Please use "toil.lib.conversions.bytes2human()" instead." [2024-09-27T21:07:42-0400] [MainThread] [I] [toil-rt] 2024-09-27 21:07:42.648092: Successfully ran: "paffy invert -i /scratch/jjsmit3/compar/cactus2/tmp/toilwf-6e7bff2e77ae58f6badf2109cb081918/3903/job/tmpzyr_2xvq/L-LH_vs_LV.inv.paf" in 19117.5344 seconds with job-memory 3.9 Ti [2024-09-27T21:07:42-0400] [MainThread] [W] [root] Deprecated toil method. Please call "logging.getLevelName" directly. [2024-09-27T21:07:42-0400] [MainThread] [I] [toil-rt] 2024-09-27 21:07:42.658909: Running the command: "paffy chain -i /scratch/jjsmit3/compar/cactus2/tmp/toilwf-6e7bff2e77ae58f6badf2109cb081918/3903/job/tmpzyr_2xvq/L-LH_vs_LV.paf --maxGapLength 1000000 --chainGapOpen 5000 --chainGapExtend 1 --trimFraction 1.0 --logLevel INFO" [2024-09-27T23:54:40-0400] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files: [2024-09-27T23:54:40-0400] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-combine_chunks/instance-kv2sq_kg/file-77d0e71687c041d78a9c0a3792f32448/tmp0mjascex.tmp' to path '/scratch/jjsmit3/compar/cactus2/tmp/toilwf-6e7bff2e77ae58f6badf2109cb081918/3903/job/tmpzyr_2xvq/L-LH_vs_LV.paf' [2024-09-27T23:54:41-0400] [MainThread] [W] [toil.fileStores.abstractFileStore] LOG-TO-MASTER: Job used more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the disk requirement. Job 'chain_one_alignment' kind-chain_one_alignment/instance-9c2cwbjx v4 used 100.00% disk (1.1 TiB [1160905360384B] used, 1.1 TiB [1160905270440B] requested). [2024-09-27T23:54:41-0400] [MainThread] [C] [toil.worker] Worker crashed with traceback: Traceback (most recent call last): File "/scratch/jjsmit3/bin/cactus-bin-v2.9.0/venv-cactus-v2.9.0/lib/python3.10/site-packages/toil/worker.py", line 438, in workerScript job._runner(jobGraph=None, jobStore=job_store, fileStore=fileStore, defer=defer) File "/scratch/jjsmit3/bin/cactus-bin-v2.9.0/venv-cactus-v2.9.0/lib/python3.10/site-packages/toil/job.py", line 2984, in _runner returnValues = self._run(jobGraph=None, fileStore=fileStore) File "/scratch/jjsmit3/bin/cactus-bin-v2.9.0/venv-cactus-v2.9.0/lib/python3.10/site-packages/toil/job.py", line 2895, in _run return self.run(fileStore) File "/scratch/jjsmit3/bin/cactus-bin-v2.9.0/venv-cactus-v2.9.0/lib/python3.10/site-packages/toil/job.py", line 3158, in run rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs) File "/scratch/jjsmit3/bin/cactus-bin-v2.9.0/venv-cactus-v2.9.0/lib/python3.10/site-packages/cactus/paf/local_alignment.py", line 383, in chain_one_alignment cactus_call(parameters=['paffy', 'chain', "-i", alignment_path, File "/scratch/jjsmit3/bin/cactus-bin-v2.9.0/venv-cactus-v2.9.0/lib/python3.10/site-packages/cactus/shared/common.py", line 910, in cactus_call raise RuntimeError("{}Command {} exited {}: {}".format(sigill_msg, call, process.returncode, out)) RuntimeError: Command ['paffy', 'chain', '-i', '/scratch/jjsmit3/compar/cactus2/tmp/toilwf-6e7bff2e77ae58f6badf2109cb081918/3903/job/tmpzyr_2xvq/L-LH_vs_LV.paf', '--maxGapLength', '1000000', '--chainGapOpen', '5000', '--chainGapExtend', '1', '--trimFraction', '1.0', '--logLevel', 'INFO'] exited 1: stderr=Input file string : /scratch/jjsmit3/compar/cactus2/tmp/toilwf-6e7bff2e77ae58f6badf2109cb081918/3903/job/tmpzyr_2xvq/L-LH_vs_LV.paf Output file string : (null) Trim chained alignment ends by : 1.000000 % Maximum gap length : 1000000 Chain gap open : 5000 Chain gap extend : 1 Expand hash failed

glennhickey commented 1 month ago

The disk usage warning is rarely fatal. It seems the cause of your crash is

Expand hash failed

which will happen when you run out of memory. These monster PAF files that cause issues are usually due to insufficient repeat masking, a subject that is an ongoing thorn in cactus's side.

Are your input genomes well masked with RepeatMasker? Just how big are these chromosomes?

jeramiahsmith commented 1 month ago

This set of alignments is between 5 orthologous salamander chromosomes that range in size between 1.0 and 1.7 Gb. This is the smallest set of cleanly orthologous chromosomes for that group. The kmer based making pipeline in cactus ran and masked between 4 and 72% of the sequence of each chromosome. We can certainly do additional repeat masking and try it again. We did alignments between two of these species with an older version of cactus after external repeat masking and that ran to completion. Though those were not the two species that were being aligned at the time the pipeline crashed.

On Mon, Oct 7, 2024 at 7:17 PM Glenn Hickey @.***> wrote:

The disk usage warning is rarely fatal. It seems the cause of your crash is

Expand hash failed

which will happen when you run out of memory. These monster PAF files that cause issues are usually due to insufficient repeat masking, a subject that is an ongoing thorn in cactus's side.

Are your input genomes well masked with RepeatMasker? Just how big are these chromosomes?

— Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/1494#issuecomment-2398118955, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTZ547THUQOXHQ6EU2ER7LZ2MI7NAVCNFSM6AAAAABPIFERLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJYGEYTQOJVGU . You are receiving this because you authored the thread.Message ID: @.***>

-- Jeramiah Smith Professor Department of Biology University of Kentucky Lexington, KY 40506

Confidentiality Statement This e-mail transmission and any files that accompany it may contain sensitive information belonging to the sender. The information is intended only for the use of the individual or entity named. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or the taking of any action in reliance on the contents of this information is strictly prohibited.

glennhickey commented 1 month ago

Yeah, I think more masking and/or more memory is the only way forward here. I think this is something I'm going to run into soon as well once we start the VGP alignment. I've been playing with different lastz parameter sets, but still don't have a satisfactory solution.

jeramiahsmith commented 1 month ago

It has certainly gotten further and runs faster with masking (RepeatMasker

On Tue, Oct 15, 2024 at 11:11 AM Glenn Hickey @.***> wrote:

Yeah, I think more masking and/or more memory is the only way forward here. I think this is something I'm going to run into soon as well once we start the VGP alignment. I've been playing with different lastz parameter sets, but still don't have a satisfactory solution.

— Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/1494#issuecomment-2414228698, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALTZ546OS2DTE35FFUMDWUTZ3UWDHAVCNFSM6AAAAABPIFERLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJUGIZDQNRZHA . You are receiving this because you authored the thread.Message ID: @.***>

-- Jeramiah Smith Professor Department of Biology University of Kentucky Lexington, KY 40506

Confidentiality Statement This e-mail transmission and any files that accompany it may contain sensitive information belonging to the sender. The information is intended only for the use of the individual or entity named. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or the taking of any action in reliance on the contents of this information is strictly prohibited.