Open evgenyleushkin opened 1 month ago
Not sure I understand -- Are you able to show a small MAF region instead of the browser view?
In this example, top alignment is without gaps in reference (hg38) and bottom alignment is with gaps up to 100 in reference allowed. As you can see, we are missing alignment for species #10,11,12,16,17 on the right side of the bottom alignment. I chose browser view, since it's easier to see this way, since we are missing some or other sequences in the alignment blocks in maf representation. Skipped alignments don't depend on the maximumGapLength threshold, just simply appear when > 0.
I see it now. That certainly looks like a bug. Which version of Cactus are you using?
2.6.13 (sorry for late response). But it's the same among different versions.
Are you able to share any data to help me reproduce this? Ideally the HAL file and the problem region. But even a "raw" maf file for the region may be enough.
Problem is not for a specific region, but basically everywhere across maf-file. We can share of course. Hal file is 43Gb, and maf is 8.5Gb. Should I upload to some designated location? Otherwise I can upload to our cloud (would need your email address to send details).
That's fine, as long as I can get it with wget
or something from the terminal. But you posted a screenshot before. If you can just extract the MAF for that region with and without options that you think are causing the problems, and share those two MAFs along with the two cactus-hal2maf
command lines, than that should be enough (note that cactus-hal2maf
can now process subregions...). Thanks!
I used this command with gap_length=0 or gap_length>0 (100 on screenshot, but same is happening with gap_length=1)
cactus-hal2maf jobstore/hal2maf$gap_length final_alignment.hal final_alignment$gap_length.maf --refGenome hg38 --noAncestors --filterGapCausingDupes --chunkSize 1000000 --maximumGapLength $gap_length
which parameter is used for regions subset? Didn't notice subregions in help info... Otherwise I can send full mafs, but would need your email to send info for download.
They're in the genome selection section (you may need to upgrade to a newer cactus release if you don't see them)
--bedRanges
or--refSequence
, --start
and --length
What I think is happening is that when you turn off normalization (which effectively happens wit h--maximumGapLength 0
) then you get very tiny MAF blocks. Then when you filter duplications after, each block is free to choose the duplication with the most matches (the heuristic it uses).
But if you leave the normalization on (using the default --maximumGapLength
) then the MAF blocks are much, much bigger. When filtering duplicates on those, only one row can be chosen, and it may have many more gaps than the combination of rows chosen for the fragmented MAF.
Not sure if I fully understand. I get the same good result (i.e. no weird gaps in non-reference sequences), if I use --maximumGapLength 0 or if I just leave it (default). But then I don't have gaps in my reference, which is not ideal for our downstream analyses. Weird gaps in non-reference sequences appear only when using --maximumGapLength > 0 (e.g. 100).
What combination of parameters would you recommend then, if we want to have gaps in reference sequence, but avoid this weird incorrect gaps in non-reference?
Say I have a MAF block like this
s dog 0 5 + 1000 AA--AAA
s cat 0 5 + 1000 AAAA--A
s cat 50 6 + 1000 A-AAAAA
and I want to filter duplicates, then I'd end up with a gap in the second position
s dog 0 5 + 1000 AA--AAA
s cat 50 6 + 1000 A-AAAAA
But If my original MAF block was unmerged, I'd have two blocks
a
s dog 0 2 + 1000 AA
s cat 0 2 + 1000 AA
s cat 50 1 + 1000 A-
a
s dog 3 3 + 1000 AAA
s cat 4 1 + 1000 --A
s cat 53 3 + 1000 AAA
And If I filtered duplicates on them I'd have no gaps (since I don't need to keep any synteny through the cat
row`
a
s dog 0 2 + 1000 AA
s cat 0 2 + 1000 AA
a
s dog 3 3 + 1000 AAA
s cat 53 3 + 1000 AAA
This is to show that the different normalization steps cactus-hal2maf
does can combine to affect the coverage you see. If you merge together adjacent blocks, you will have a more syntenic, simpler MAF with probably 10X fewer blocks. But filtering duplicates will lead to more gaps.
Note that while duplicate filtering is off by default, the --maximumGapLength
defaults to 30
. So if you're not setting it, you will still get reference gaps (at least in recent cactus versions).
Can MAF export / normalization be further improved for better duplicate filtering? Probably -- I'm just explaining what the tool does currently.
Dear cactus developers,
We recently encountered an issue with cactus-hal2maf conversion, when --maximumGapLength >0 (e.g. 50 or 100) is used to allow for the gaps in reference (hg38) sequence in alignment blocks. Conversion appears to work fine, when the parameter is not used. Is there smth we overlook or some easy way to fix it?
Below you can see the alignment with and without using --maximumGapLength.
Thanks a lot for your help!