ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

cactus-hal2maf conversion skips aligned regions, when --maximumGapLength > 0 #1376

Open evgenyleushkin opened 1 month ago

evgenyleushkin commented 1 month ago

Dear cactus developers,

We recently encountered an issue with cactus-hal2maf conversion, when --maximumGapLength >0 (e.g. 50 or 100) is used to allow for the gaps in reference (hg38) sequence in alignment blocks. Conversion appears to work fine, when the parameter is not used. Is there smth we overlook or some easy way to fix it?

Below you can see the alignment with and without using --maximumGapLength.

Screenshot 2024-05-02 at 23 03 19

Thanks a lot for your help!

glennhickey commented 1 month ago

Not sure I understand -- Are you able to show a small MAF region instead of the browser view?

evgenyleushkin commented 1 month ago

In this example, top alignment is without gaps in reference (hg38) and bottom alignment is with gaps up to 100 in reference allowed. As you can see, we are missing alignment for species #10,11,12,16,17 on the right side of the bottom alignment. I chose browser view, since it's easier to see this way, since we are missing some or other sequences in the alignment blocks in maf representation. Skipped alignments don't depend on the maximumGapLength threshold, just simply appear when > 0.

glennhickey commented 1 month ago

I see it now. That certainly looks like a bug. Which version of Cactus are you using?

evgenyleushkin commented 1 month ago

2.6.13 (sorry for late response). But it's the same among different versions.

glennhickey commented 1 month ago

Are you able to share any data to help me reproduce this? Ideally the HAL file and the problem region. But even a "raw" maf file for the region may be enough.

evgenyleushkin commented 1 month ago

Problem is not for a specific region, but basically everywhere across maf-file. We can share of course. Hal file is 43Gb, and maf is 8.5Gb. Should I upload to some designated location? Otherwise I can upload to our cloud (would need your email address to send details).

glennhickey commented 1 month ago

That's fine, as long as I can get it with wget or something from the terminal. But you posted a screenshot before. If you can just extract the MAF for that region with and without options that you think are causing the problems, and share those two MAFs along with the two cactus-hal2maf command lines, than that should be enough (note that cactus-hal2maf can now process subregions...). Thanks!

evgenyleushkin commented 1 month ago

I used this command with gap_length=0 or gap_length>0 (100 on screenshot, but same is happening with gap_length=1)

cactus-hal2maf jobstore/hal2maf$gap_length final_alignment.hal final_alignment$gap_length.maf --refGenome hg38 --noAncestors --filterGapCausingDupes --chunkSize 1000000 --maximumGapLength $gap_length

which parameter is used for regions subset? Didn't notice subregions in help info... Otherwise I can send full mafs, but would need your email to send info for download.

glennhickey commented 1 month ago

They're in the genome selection section (you may need to upgrade to a newer cactus release if you don't see them)

What I think is happening is that when you turn off normalization (which effectively happens wit h--maximumGapLength 0) then you get very tiny MAF blocks. Then when you filter duplications after, each block is free to choose the duplication with the most matches (the heuristic it uses).

But if you leave the normalization on (using the default --maximumGapLength) then the MAF blocks are much, much bigger. When filtering duplicates on those, only one row can be chosen, and it may have many more gaps than the combination of rows chosen for the fragmented MAF.

evgenyleushkin commented 1 month ago

Not sure if I fully understand. I get the same good result (i.e. no weird gaps in non-reference sequences), if I use --maximumGapLength 0 or if I just leave it (default). But then I don't have gaps in my reference, which is not ideal for our downstream analyses. Weird gaps in non-reference sequences appear only when using --maximumGapLength > 0 (e.g. 100).

What combination of parameters would you recommend then, if we want to have gaps in reference sequence, but avoid this weird incorrect gaps in non-reference?

glennhickey commented 1 month ago

Say I have a MAF block like this

s  dog  0  5  +  1000  AA--AAA
s  cat  0  5  +  1000  AAAA--A
s  cat  50  6  + 1000  A-AAAAA

and I want to filter duplicates, then I'd end up with a gap in the second position

s  dog  0  5  +  1000  AA--AAA
s  cat  50  6  + 1000  A-AAAAA

But If my original MAF block was unmerged, I'd have two blocks

a
s  dog  0  2  +  1000  AA
s  cat  0  2  +  1000  AA
s  cat  50  1  + 1000  A-
a
s  dog  3  3 +   1000  AAA
s  cat  4  1  +  1000  --A
s  cat  53  3  + 1000  AAA

And If I filtered duplicates on them I'd have no gaps (since I don't need to keep any synteny through the cat row`

a
s  dog  0  2  +  1000  AA
s  cat  0  2  +  1000  AA
a
s  dog  3  3 +   1000  AAA
s  cat  53  3  + 1000  AAA

This is to show that the different normalization steps cactus-hal2maf does can combine to affect the coverage you see. If you merge together adjacent blocks, you will have a more syntenic, simpler MAF with probably 10X fewer blocks. But filtering duplicates will lead to more gaps.

Note that while duplicate filtering is off by default, the --maximumGapLength defaults to 30. So if you're not setting it, you will still get reference gaps (at least in recent cactus versions).

Can MAF export / normalization be further improved for better duplicate filtering? Probably -- I'm just explaining what the tool does currently.