Large variation in run time and data recovery of cactus-hal2maf when switching reference genomes

Hi Glenn!

I'm currently running cactus-hal2maf on a new alignment I've generated using Cactus v2.8.2. I'm running Cactus as a singularity image on a computing cluster. Anyways, I have an alignment for a single chromosome from 62 species, with four of those species having a T2T chromosome. I've written my job script such that I can produce two MAF alignments at one time - one with the human as my reference genome, and one with the cow.

I'm not sure why, but there's a big disparity in run-time between the two (as well as in aligned sequences mapped back to the reference). For the human, hal2maf takes less than 5 minutes to run with the following parameters: --refGenome Homo_sapiens --noAncestors --chunkSize 1000000 --filterGapCausingDupes --dupeMode consensus. When I convert that MAF to a FASTA, it looks great - no issues!

When producing the cow-reference alignment, it timed out after 3 hours (which is unusual - I know that cactus-hal2maf is said to run slow, but I haven't run into this before in earlier runs for different alignments). I've gone through some previous issues on GitHub related to this and saw that you recommended specifying some additional parameters to make the SLURM run itself more efficient, which I've since done ( --refGenome Bos_taurus --noAncestors --chunkSize 1000000 --filterGapCausingDupes --dupeMode consensus --batchCount 4 --batchCores 2 --batchMemory 16GB ). This helped, but it took 6.5hrs. I tried converting this MAF to a FASTA, but I didn't seem to capture nearly as much aligned sequence as I did when using the human as a reference, which was odd based on the fact that both reference species are T2T and it's all the same input data. I also noticed that the cow chromosome in the FASTA alignment was much smaller than what it normally is.

I've looked back at the log files for both runs (the one that timed out and the one where I updated the parameters). It seems that:

a) for both runs, Cactus often gets caught on the following step: [2024-06-09T11:59:01-0500] [MainThread] [I] [toil-rt] First of 59 commands in parallel batch: set -eo pipefail && gzip -dc cactus_run10_BosTau_ref_chunk_0.maf.gz | sed -f genome_to_alnum.sed | /usr/bin/time -vp mafRowOrderer -m - --order-file genome.list 2> 0.sort.time | sed -f alnum_to_genome.sed | /usr/bin/time -vp taffy view 2> 0.m2t.time | /usr/bin/time -vp taffy norm -a cactus_run10.hal -k -d 2> 0.tn.time | sed -f genome_to_alnum.sed | maf_stream merge_dups consensus | mafRowOrderer -m - --order-file genome.list | sed -f alnum_to_genome.sed | bgzip > cactus_run10_BosTau_ref_chunk_0.norm.maf.gz && mv cactus_run10_BosTau_ref_chunk_0.norm.maf.gz cactus_run10_BosTau_ref_chunk_0.maf.gz [2024-06-09T11:59:01-0500] [MainThread] [I] [toil-rt] 2024-06-09 11:59:01.250084: Running the command: "bash -c set -eo pipefail && cat /tmp/job.10499457/toilwf-afa2039f8c3851b0a61b40b9edf79733/9645/f006/tmp7kbo1n_d/taf_cmds.txt | parallel -j 4 '{}'" [2024-06-09T12:57:27-0500] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run [2024-06-09T13:57:27-0500] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
b) when adjusting the SLURM parameters, the order that the blocks were made in comparison to the reference sequence was reversed and started at the end - the reference sequence itself is about ~60,000,000 bp in length, and the first blocks to be converted are those starting at 45000000 to 60000000, such that: [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 45000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.64 and memory: 9.6 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 46000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.6 and memory: 9.7 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 47000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.33 and memory: 9.8 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 48000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.32 and memory: 9.7 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 49000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.32 and memory: 9.6 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 50000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.34 and memory: 9.7 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 51000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.34 and memory: 9.7 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 52000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.32 and memory: 9.8 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 53000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.32 and memory: 9.6 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 54000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.35 and memory: 9.8 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 55000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.33 and memory: 9.6 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 56000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.34 and memory: 9.7 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 57000000 --length 1000000 --maxBlockLen 10000 --noAncestors in time: 0.34 and memory: 13.5 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] Successfully ran hal2maf cactus_run10.hal stdout --refGenome Bos_taurus --refSequence X --start 58000000 --length 1476289 --maxBlockLen 10000 --noAncestors in time: 5.04 and memory: 85.9 Mi [2024-06-12T13:28:35-0500] [MainThread] [I] [toil-rt] First of 14 commands in parallel batch: set -eo pipefail && gzip -dc cactus_run10_BosTau_ref_chunk_0.maf.gz | sed -f genome_to_alnum.sed | /usr/bin/time -vp mafRowOrderer -m - --order-file genome.list 2> 0.sort.time | sed -f alnum_to_genome.sed | /usr/bin/time -vp taffy view 2> 0.m2t.time | /usr/bin/time -vp taffy norm -a cactus_run10.hal -k -d 2> 0.tn.time | sed -f genome_to_alnum.sed | maf_stream merge_dups consensus | mafRowOrderer -m - --order-file genome.list | sed -f alnum_to_genome.sed | bgzip > cactus_run10_BosTau_ref_chunk_0.norm.maf.gz && mv cactus_run10_BosTau_ref_chunk_0.norm.maf.gz cactus_run10_BosTau_ref_chunk_0.maf.gz

If you could help advise me on what to do, that would be appreciated. I figured that I could just run the initial command again and allocate more time to it as that created blocks starting from 1 to 60,000,000 but just took a long time to run? But I also wasn't sure if I would get different results and if it was significant that the order of the blocks was not starting from the start of the sequence.

Hope this makes sense, and I'd be happy to clarify any points that don't. Thanks for all your help.

ComparativeGenomicsToolkit / cactus

Large variation in run time and data recovery of cactus-hal2maf when switching reference genomes #1415