COMBINE-lab / pufferfish

An efficient index for the colored, compacted, de Bruijn graph
GNU General Public License v3.0
107 stars 19 forks source link

Implementation of soft-clipping and CIGAR computation #25

Open haghshenas opened 3 years ago

haghshenas commented 3 years ago

Here is a short summary of the changes:

With these changes I compared the output of PuffAligner with Minimap2 for single-end reads. This comparison showed that generally, the alignment is computed correctly in most of the cases.

The next essential step: In some cases the alignment is wrong. The reason is that the extension is stopped (ez->stopped==1) and therefore, the score for reaching the end of query is not calculated properly. As a result, it might be the case that score for reaching end of query is incorrectly higher than the maximum scoring prefix alignment. Here is an example:

pufferfish alignment
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGCTGGTATCTCCAAGAATGGGCN
||||||||||||||||||||||||||||||||||||x|||||||||||||||||||||||||||||||||||||||||x|x|x|||x|||x||||||||x
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTCTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGGTAGCATCCCCAGGAATGGGCA

minimap alignment
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAG----------------------
||||||||||||||||||||||||||||||||||||x|||||||||||||||||||||||||||||||||||||||||
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTCTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGGTAGCATCCCCAGGAATGGGCA

It seems that extension stopping was not written with soft-clipping and computation of CIGAR strings in mind, so this requires a fix. I will check if the original KSW2 extension fixes this issue and if it does, we can discuss about whether it's worth to have the stopping feature or not.

So next items:

mohsenzakeri commented 3 years ago

I looked at the four commits and all the changes look great!

I tested the soft-clipping with a couple of small examples and I found a case with the end2end mode where we don't expect to see any soft-clipped alignment, but we do. Here is the example:

query:
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGCTGGTATCTCCAAGAATGGGCTGTGTGTGTGTGTCCCCCCC
target:
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGCTGGTATCTCCAAGAATGGGCTGTGTGTGTGTGT
or
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGCTGGTATCTCCAAGAATGGGCTGTGTGTGTGTGTT

For these targets, when running the command pufferfish align -i new.index/ --read read.fa -o out.sam --end2end results in the aligning the read with soft-clipped bases: 113M6S and 112M7S Maybe, in this case, we wanna report insertions (I) instead of (S).

Also in the case of running without the end2end flag, the alignment scores seem wrong.

I have some fixes which I can commit, but we can also have a discussion before going forward.

mohsenzakeri commented 3 years ago

Based on my review and the conversations we had about the changes before, everything looks good to me.