Alignment does not take into account quality data

Issue by GoogleCodeExporter Tuesday Mar 29, 2016 at 22:33 GMT Originally opened as https://github.com/catdesk/seqtrace-a/issues/5

Since the alignment does not take into account quality data, it causes some 
final sequence errors which would logically be ignored during manual 
inspection. All settings were following the default install except minimum 
quality which was set to 20 for the purpose of showing Example 2. 

Example 1: Insertion errors (insertions.png)
Trace 1 has a high-quality trace which says CC. Trace 2 is just beginning, with 
a low quality N added into the sequence. This results in a final base call of 
CNC which is clearly not the case. 
Example 2: Bayesian poisoning due to misalignments (starting.png)
Trace 1 has a low-quality starting trace, which is misaligned. It has a C with 
a quality of 23. The misalignment pairs it with a G with a quality of 28, which 
is marked as N due to the disagreement, throwing off the Bayesian base caller. 
Previous bases (the A and C) with lower qualities are called correctly. 

To suppress errors of the first kind, code might be added to look for 
insertions of N within a high-quality (above minimum threshold) and 
automatically remove these insertions. 
To suppress errors of the second kind, it could be possible to implement a 
"trace-trimming" feature, using the same code used to trim the final sequence, 
in order to remove misaligned starts and ends of traces. 
Sinces traces also suffer from clusters (~10 bases) of low quality data points 
from 20-160 bases, they should also be appropriately treated when it comes to 
alignments. 

The perfect solution would be to have an alignment algorithm which takes 
quality into account, but lacking those, these aforementioned things will be 
good stopgaps.

Original issue reported on code.google.com by linyiers...@gmail.com on 14 Aug 2014 at 8:33

Attachments:

catdesk / seqtrace

Alignment does not take into account quality data #5