ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

Large variation in run time and data recovery of cactus-hal2maf when switching reference genomes #1415

Open emistasis opened 2 weeks ago

emistasis commented 2 weeks ago

Hi Glenn!

I'm currently running cactus-hal2maf on a new alignment I've generated using Cactus v2.8.2. I'm running Cactus as a singularity image on a computing cluster. Anyways, I have an alignment for a single chromosome from 62 species, with four of those species having a T2T chromosome. I've written my job script such that I can produce two MAF alignments at one time - one with the human as my reference genome, and one with the cow.

I'm not sure why, but there's a big disparity in run-time between the two (as well as in aligned sequences mapped back to the reference). For the human, hal2maf takes less than 5 minutes to run with the following parameters: --refGenome Homo_sapiens --noAncestors --chunkSize 1000000 --filterGapCausingDupes --dupeMode consensus. When I convert that MAF to a FASTA, it looks great - no issues!

When producing the cow-reference alignment, it timed out after 3 hours (which is unusual - I know that cactus-hal2maf is said to run slow, but I haven't run into this before in earlier runs for different alignments). I've gone through some previous issues on GitHub related to this and saw that you recommended specifying some additional parameters to make the SLURM run itself more efficient, which I've since done ( --refGenome Bos_taurus --noAncestors --chunkSize 1000000 --filterGapCausingDupes --dupeMode consensus --batchCount 4 --batchCores 2 --batchMemory 16GB ). This helped, but it took 6.5hrs. I tried converting this MAF to a FASTA, but I didn't seem to capture nearly as much aligned sequence as I did when using the human as a reference, which was odd based on the fact that both reference species are T2T and it's all the same input data. I also noticed that the cow chromosome in the FASTA alignment was much smaller than what it normally is.

I've looked back at the log files for both runs (the one that timed out and the one where I updated the parameters). It seems that:

If you could help advise me on what to do, that would be appreciated. I figured that I could just run the initial command again and allocate more time to it as that created blocks starting from 1 to 60,000,000 but just took a long time to run? But I also wasn't sure if I would get different results and if it was significant that the order of the blocks was not starting from the start of the sequence.

Hope this makes sense, and I'd be happy to clarify any points that don't. Thanks for all your help.

glennhickey commented 1 week ago

Hi, it's hard to say exactly. You can double check the coverage in your alignment using halCoverage -- run in on both human and cow and compare. These genomes are diverged enough that if you only have one chromosome in your alignment, depending on the other species, you can have quite different coverages, and this could lead to quite different mafs.

The step that's taking long taffy norm is probably spending its time adding "inserted" sequences into the alignment. For cow, these would be sequences that don't align directly to cow, but align to each other. This would be consistent with cow aligning more poorly overall in your alignment (which should be evident in the coverage stats).

You can also use taffy coverage to look at the coverages in your MAF files. As for the order things are run in: it's effectively random -- toil submits them all at once.

To get better alignments in general, and more consistency between different chosen references, I strongly recommend aligning the whole genomes at once, rather than just a chromosome.