ekg / seqwish

alignment to variation graph inducer
MIT License
143 stars 18 forks source link

abnormal alignments #109

Open sdws1983 opened 1 year ago

sdws1983 commented 1 year ago

Hello!

I am using seqwish for pan-genome construction, and I have a question about the alignment result.

Typically, according to the pggb pipeline, the wfmash tool is used for mapping and the resulting PAF file is input into seqwish for construction. However, I have used anchorwave instead of wfmash for mapping, and after some format conversion, I have obtained a PAF-format alignment result. I have input this PAF file into seqwish and followed the pggb pipeline for subsequent steps (including smoothxg and odgi). However, the resulting 2D graph appears to be mostly composed of abnormal alignments, even though the anchorwave alignment results are better than those obtained using wfmash (including longer total alignment length). I suspect that the cause of the abnormal 2D graph is the shorter average alignment length using anchorwave (around 25k versus over 400k with wfmash). However, I am not sure if my understanding is correct and how to modify the seqwish parameters to avoid abnormal 2D graphs. Can anyone provide me with some suggestions? Thank you!

anchorwave graph: 下载

wfmash graph: 下载 (1)

Here are some of the commands I used:

seqwish -s $fa -p $paf -k $k -f 0 -g ${prefix}.seqwish.gfa -B 10000000 -t 48 --temp-dir ./ -P
smoothxg -t 48 -T 48 -g ${prefix}.seqwish.gfa -r 2 --chop-to 100 -I .9000 -R 0 -j 0 -e 0 -l 13033,13177 -P 1,4,6,2,26,1 -O 0.03 -Y 1100 -d 0 -D 0 -S -Q Consensus_ -V -o ${prefix}.smooth.gfa
gfaffix ${prefix}.smooth.gfa -o ${prefix}.smooth.fix.gfa

Thanks!

Yumin

AndreaGuarracino commented 1 year ago

Hi @sdws1983, are you showing the graphs that seqwish emits, right? or the output of smoothxg? Would you please explain more about why anchorwave's alignments are better? The anchorwave graph looks strongly underaligned.

sdws1983 commented 1 year ago

Hi @sdws1983, are you showing the graphs that seqwish emits, right? or the output of smoothxg? Would you please explain more about why anchorwave's alignments are better? The anchorwave graph looks strongly underaligned.

I used the output of smoothxg.

For the alignment quality, I calculated the total Number of residue matches and Alignment block length for both wfmash and anchorwave (The sum of the ninth and tenth columns in the paf file): for anchorwave, the total matches is 42654922 and alignment block length is 202495300, which slightly higher than the wfmash (41712802 and 88998748).

In fact, the anchorwave graph looks strongly underaligned is so strange, because I noticed some perfect alignments (only a few mismatches) in paf were underaligned in the final gfa graph.

AndreaGuarracino commented 1 year ago

Could you show the output of seqwish, instead? For both wfmash's PAF and anchorwave's PAF.

sdws1983 commented 1 year ago

Sure! Here are the outputs:

wfmash: image

anchorwave: image

AndreaGuarracino commented 1 year ago

Thank you! I was not expecting so much white on the top-right! Any chance to share the FASTA input?

ekg commented 1 year ago

What is the length of each graph (from seqwish or smoothxg)? odgi stats -S should provide this.

On Mon, Feb 20, 2023, 20:01 Andrea Guarracino @.***> wrote:

Thank you! I was not expecting so much white on the top-right! Any chance to share the FASTA input?

— Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/109#issuecomment-1437757748, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEP3Z6ALNK6DGBCN3GTWYQOY5ANCNFSM6AAAAAAVBZPZCU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

sdws1983 commented 1 year ago

By the way, when I merged the adjacent alignment blocks in anchorwave results, it seems OK now:

image

sdws1983 commented 1 year ago

What is the length of each graph (from seqwish or smoothxg)? odgi stats -S should provide this. On Mon, Feb 20, 2023, 20:01 Andrea Guarracino @.> wrote: Thank you! I was not expecting so much white on the top-right! Any chance to share the FASTA input? — Reply to this email directly, view it on GitHub <#109 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEP3Z6ALNK6DGBCN3GTWYQOY5ANCNFSM6AAAAAAVBZPZCU . You are receiving this because you are subscribed to this thread.Message ID: @.>

for anchorwave:

#length nodes   edges   paths   steps
262082885       188761  257675  2       257231

for wfmash:

#length nodes   edges   paths   steps
227029223       2326424 3171930 2       3234227
sdws1983 commented 1 year ago

@AndreaGuarracino @ekg So I wonder why the abnormal graph disappears when I merge the alignment intervals to make them larger. Is there any way that seqwish can handle PAF files with smaller alignment intervals?

ekg commented 1 year ago

Note that the anchorwave graph is 40mbp smaller than the wfmash one.

That suggests a significant amount of difference in the alignments. It's not clear which is true but in general a smaller graph indicates a better input alignment.

On Wed, Feb 22, 2023, 20:22 Yumin Huang @.***> wrote:

@AndreaGuarracino https://github.com/AndreaGuarracino @ekg https://github.com/ekg So I wonder why the abnormal graph disappears when I merge the alignment intervals to make them larger. Is there any way that seqwish can handle PAF files with smaller alignment intervals?

— Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/109#issuecomment-1441139141, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPY2KFLQX5BIXLCB43WY3CW3ANCNFSM6AAAAAAVBZPZCU . You are receiving this because you were mentioned.Message ID: @.***>

sdws1983 commented 1 year ago

Note that the anchorwave graph is 40mbp smaller than the wfmash one. That suggests a significant amount of difference in the alignments. It's not clear which is true but in general a smaller graph indicates a better input alignment. On Wed, Feb 22, 2023, 20:22 Yumin Huang @.> wrote: @AndreaGuarracino https://github.com/AndreaGuarracino @ekg https://github.com/ekg So I wonder why the abnormal graph disappears when I merge the alignment intervals to make them larger. Is there any way that seqwish can handle PAF files with smaller alignment intervals? — Reply to this email directly, view it on GitHub <#109 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPY2KFLQX5BIXLCB43WY3CW3ANCNFSM6AAAAAAVBZPZCU . You are receiving this because you were mentioned.Message ID: @.>

Yes, I agree with your point. I used anchorwave even though it may lose some alignments, it simplifies the graph and makes downstream analysis more convenient.

What I want to say here is that I fetched the PAF file from anchorwave, and although the dot plot shows extensive aligned regions (as shown below), these alignments disappeared after running with seqwish.

image
baozg commented 1 year ago

@sdws1983 Did you keep the CIGAR in the PAF from AnchorWave? What I tried before in the tetraploid potato, the AnchorWave-based graph is similar to yours. But AnchorWave intend to create a large block with a large gap (benefits of gene anchor for repeat-rich genome), but it was not natively supported PAF output (basic format covert). Have you tried you use another CIGAR-based SV caller for AnchorWave alignment (eg. SyRI) to compare the variant callset?

sdws1983 commented 1 year ago

@sdws1983 Did you keep the CIGAR in the PAF from AnchorWave? What I tried before in the tetraploid potato, the AnchorWave-based graph is similar to yours. ButAnchorWave` intend to create a large block with a large gap (benefits of gene anchor for repeat-rich genome), but it was not natively supported PAF output (basic format covert). Have you tried you use another CIGAR-based SV caller for AnchorWave alignment (eg. SyRI) to compare the variant callset?

@baozg I keep the CIGAR from the sam file generated from maf-convert. So you think we should use another aligner for CIGAR generation based on the AnchorWave blocks? I do not fully understand your advice.

baozg commented 1 year ago

For the CIGAR inspection, you could try caller. But I am not sure of the difference between these, I think seqwish also builds graphs relying on the CIGAR. The default output of AnchorWave MAF output, I guess you follow the MAF->SAM-> PAF workflow. Ideally, it should be the same, but I am not sure if all these convert keep all the information or AnchorWave use the same pattern of CIGAR with other aligners (large gap blocks introduced by the gene synteny). I will try AnchorWave again for building graphs and update the information if I have found something new.

sdws1983 commented 1 year ago

For the CIGAR inspection, you could try caller. But I am not sure of the difference between these, I think seqwish also builds graphs relying on the CIGAR. The default output of AnchorWave MAF output, I guess you follow the MAF->SAM-> PAF workflow. Ideally, it should be the same, but I am not sure if all these convert keep all the information or AnchorWave use the same pattern of CIGAR with other aligners (large gap blocks introduced by the gene synteny). I will try AnchorWave again for building graphs and update the information if I have found something new.

Thanks, Zhigui. I have also been thinking about these issues recently. I thought it might be related to the anchorwave's alignment mechanism which induced many gaps in its CIGAR characters. Would this cause seqwish to consider the alignment quality to be low and delete these alignments? To address this, I tried lowering the -k parameter in seqwish to retain all the fine alignments, but the resulting graph still had almost no alignments.

On the other hand, when I used the maf file generated by anchorwave for the whole blocks and converted it to paf, I did not encounter the problem of large-scale mismatches when constructing the graph. This is very strange because the large block is composed of all small alignment results strung together. I have not yet found a reason to explain these issues, but I will continue to try to solve them. I hope we can continue to communicate.

baozg commented 1 year ago

If you want to use AnchorWave for the population-scale assembly variant calling with plant genomes, their solution was converting it into GVCF and merging (https://www.biorxiv.org/content/10.1101/2023.03.02.530873v1).