baoxingsong / AnchorWave

Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism and whole-genome duplication variation
MIT License
145 stars 19 forks source link

Too many mismatches and could you please give some advice on alignment parameters? #32

Closed Yutang-ETH closed 1 year ago

Yutang-ETH commented 2 years ago

Hi Baoxing,

First of all, thank you very much again for creating anchorwave, I had a really good experience with it.

Here's the thing, I tried anchorwave to align two homologous chromosomes recently using anchorwave genoAli with -IV and other default parameters. Overall, I got very good whole-chromosome alignment, however there are one or two regions where the alignment doesn't look very nice. I visualized the alignment using Jalview and I could see some strange patters like below, image It looks like the region shown in the plot is very divergent between two haplotypes, I am wondering which parameter I could tune to improve the alignment? Or maybe somehow leave this region unaligned. Could you please give a short answer on how this alignment pattern was caused and if it is possible could you please also give me some advice on how to tune parameters like -B, -O1, -E1, -O2, -E2 and so on? The problem with this kind of alignment is there are too many mismatches (40k SNPs out of 100 Kb).

Thank you very much in advance and best wishes, Yutang

baoxingsong commented 2 years ago

I did not play with your data carefully, so I am not sure I am giving the correct advice.

If you increase the values of -w, you might get better alignments. However, that would significantly increase the CPU and memory costs. I do not think you would like to do that. We are working hard to reduce the computational cost of AnchoWave, but the updated version is still not working.

The easiest way at this moment might be to investigate those suspect regions and remove them from the final alignment.

Yutang-ETH commented 2 years ago

Hi Baoxing,

Thank you very much for your prompt reply. I totally agree that increasing -w might help but yesterday, just after writing to you, I was so curious that I did a test by changing some parameters: -mi2 0.6 -fa3 50000 -B -19 -O1 -39 -O2 -81 -E1 -3 I got the idea from minimap2 to set the gap penalty values (preset asm5), I also decreased the -fa value for looking for novel anchors and I increased the hit identity of new anchor because I thought this might give more solid alignment. I am not sure these changes make sense or not, what do you think? My machine has around 900 Gb memory and the program now is using around 300 Gb. I see 1/5 alignment has finished, so maybe after days, I could report you the results.

Wish you good luck for upgrading anchorwave.

Yutang

Yutang-ETH commented 2 years ago

Hi Baoxing,

Just want to show the results to you. I converted maf to sam and then to paf and made a dotplot with the paf as input. The first plot is the result of genoAli -mi2 0.6 -fa3 50000 -B -19 -O1 -39 -O2 -81 -E1 -3 and the second plot is the result of genoAli default parameters. We could see that in the first plot, a large part of alignment is missing at the right end of the chromosome, I guess the reason is due to my parameter setting, which doesn't allow the alignment to extend. So, I think maybe I should just use the default setting and remove the suspect regions in the final alignment as you suggested. By the way, the two assemblies I compared each has a size of 1.7 Gb with 7 chromosomes, anchorwave took around 4 days to finish the alignment and the peak memory usage is around 200~300 Gb with 7 cores used. On the other hand, the two assemblies are pretty divergent (mash distance > 5%), which I guess makes it more challenging for alignment.

Thank you very much again for offering anchorwave, and wish you a big success improving it.

Yutang

image image

baoxingsong commented 1 year ago

The latest code is ~15-20 times faster than the first version.

Yutang-ETH commented 1 year ago

Thank you very much for your kind reminder. Wow, 15-20 times faster! I thinks this would be definitely helpful for comparing many genomes! I will try the latest code next time.

Great job!

Yutang