Open ekinda opened 2 years ago
Yes, MAF over-fragmentation is a longstanding annoyance. AFAIK MAF fragments are mostly caused by small rearrangements or breaks in ancestral contigs. It seems that the --maxRefGap
option has its own issues too (#586 #587 #588 #589 #590 #604). The slow running-time is also an issue.
I hope we can find the resources to get this fixed up in the coming months, because it's becoming a bottleneck all over the place. Unfortunately I don't have much to suggest in the meantime.
Due to the linear nature of each sequence in a block, MAF blocks are not necessarily biologically meaningful. A small inversion in one sequence will break a MAF block.
One tool that might be useful to look at continuity between any two genomes in a HAL is halSynteny.
Ekin Deniz Aksu @.***> writes:
Hello, I used cactus to align 10 primate whole genomes (fasta files of ref genomes downloaded from UCSC, soft repeat-masked). Job took around 12-13 days, confirming the expected runtime. (used
cactus jobstore seqfile hal
with max core, memory and disk arguments, using a Linux system with 100 max cores, 500 GB max memory, cactus version 2.0.5 local binaries)Turning the HAL file (around 30 GB) into MAF with the command
hal2maf input.hal out.maf --refGenome hg38 --noAncestors --noDupes --onlyOrthologs --maxBlockLen 100000 --maxRefGap 1000
There is not a single block with length over 10000, which is highly unusual for primate alignments (where even 1 Mb blocks are seen with other aligners). Most blocks are very short, there are even blocks of 1 bp length. Some are over 1000 but not by much.
I am not sure if the issue lies with the input, parameters, hal2maf, or cactus itself.
If you need other information to diagnose the issue I will be happy to provide them.
-- Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/674 You are receiving this because you are subscribed to this thread.
Message ID: @.***>
Hello, I used cactus to align 10 primate whole genomes (fasta files of ref genomes downloaded from UCSC, soft repeat-masked). Job took around 12-13 days, confirming the expected runtime. (used
cactus jobstore seqfile hal
with max core, memory and disk arguments, using a Linux system with 100 max cores, 500 GB max memory, cactus version 2.0.5 local binaries)Turning the HAL file (around 30 GB) into MAF with the command
hal2maf input.hal out.maf --refGenome hg38 --noAncestors --noDupes --onlyOrthologs --maxBlockLen 100000 --maxRefGap 1000
There is not a single block with length over 10000, which is highly unusual for primate alignments (where even 1 Mb blocks are seen with other aligners). Most blocks are very short, there are even blocks of 1 bp length. Some are over 1000 but not by much.
I am not sure if the issue lies with the input, parameters, hal2maf, or cactus itself.
If you need other information to diagnose the issue I will be happy to provide them.