galaxyproject / dunovo

Reference-free duplex sequencing pipeline.
Other
18 stars 6 forks source link

Suboptimal alignment between strands #32

Closed NickSto closed 2 years ago

NickSto commented 2 years ago

Documenting another fixed bug here post facto.

There was an issue with the pairwise sequence aligner Du Novo was using to align single strand consensus sequences together.

The core issue was that it was considering Ns the same as actual, non-N bases. So it would consider two aligned Ns to be a match just as valuable as two aligned Cs. The end effect is an increased number of indels in regions with high error rates.

Here are some examples: https://docs.google.com/presentation/d/1qeB27_3FfjSN31r9kGQhydx7TiVCNfh3s6-84Iu6O4A/edit?usp=sharing

This was fixed in 272bb4939c (between 2.16 and 3.0) by adding the option to use another pairwise aligner: BioPython with a custom substitution matrix. Later, before the 3.0 release, this was made the default aligner.