Alignment blocks are too short, 1 bp blocks

ekinda commented 2 years ago

Hello, I used cactus to align 10 primate whole genomes (fasta files of ref genomes downloaded from UCSC, soft repeat-masked). Job took around 12-13 days, confirming the expected runtime. (used cactus jobstore seqfile hal with max core, memory and disk arguments, using a Linux system with 100 max cores, 500 GB max memory, cactus version 2.0.5 local binaries)

Turning the HAL file (around 30 GB) into MAF with the command

hal2maf input.hal out.maf --refGenome hg38 --noAncestors --noDupes --onlyOrthologs --maxBlockLen 100000 --maxRefGap 1000

There is not a single block with length over 10000, which is highly unusual for primate alignments (where even 1 Mb blocks are seen with other aligners). Most blocks are very short, there are even blocks of 1 bp length. Some are over 1000 but not by much.

I am not sure if the issue lies with the input, parameters, hal2maf, or cactus itself.

If you need other information to diagnose the issue I will be happy to provide them.

glennhickey commented 2 years ago

Yes, MAF over-fragmentation is a longstanding annoyance. AFAIK MAF fragments are mostly caused by small rearrangements or breaks in ancestral contigs. It seems that the --maxRefGap option has its own issues too (#586 #587 #588 #589 #590 #604). The slow running-time is also an issue.

I hope we can find the resources to get this fixed up in the coming months, because it's becoming a bottleneck all over the place. Unfortunately I don't have much to suggest in the meantime.

diekhans commented 2 years ago

Due to the linear nature of each sequence in a block, MAF blocks are not necessarily biologically meaningful. A small inversion in one sequence will break a MAF block.

One tool that might be useful to look at continuity between any two genomes in a HAL is halSynteny.

Ekin Deniz Aksu @.***> writes:

Hello, I used cactus to align 10 primate whole genomes (fasta files of ref genomes downloaded from UCSC, soft repeat-masked). Job took around 12-13 days, confirming the expected runtime. (used cactus jobstore seqfile hal with max core, memory and disk arguments, using a Linux system with 100 max cores, 500 GB max memory, cactus version 2.0.5 local binaries)

Turning the HAL file (around 30 GB) into MAF with the command

hal2maf input.hal out.maf --refGenome hg38 --noAncestors --noDupes --onlyOrthologs --maxBlockLen 100000 --maxRefGap 1000

There is not a single block with length over 10000, which is highly unusual for primate alignments (where even 1 Mb blocks are seen with other aligners). Most blocks are very short, there are even blocks of 1 bp length. Some are over 1000 but not by much.

I am not sure if the issue lies with the input, parameters, hal2maf, or cactus itself.

If you need other information to diagnose the issue I will be happy to provide them.

-- Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/674 You are receiving this because you are subscribed to this thread.

Message ID: @.***>

ComparativeGenomicsToolkit / cactus

Alignment blocks are too short, 1 bp blocks #674