bioforensics / MicroHapulator

Tools for empirical microhaplotype calling, forensic interpretation, and simulation.
https://microhapulator.readthedocs.io/
Other
6 stars 1 forks source link

Add initial N-filtering step #164

Closed standage closed 8 months ago

standage commented 10 months ago

The last couple days I've been looking at reads from the NimaGen kit and trying to figure out what is causing reads not to merge, and alternatively what is causing them to fail to map when they do merge.

We've long suspected primer dimer to be the main source of unmapped reads, and manual examination has borne this out. (Interestingly, the alignment algorithms I tried weren't too helpful—I'd probably need to spend more time tuning parameters.) So that's that.

It appears that reads composed entirely of Ns constitute the bulk of the unmerged reads. If we can remove these prior to running FLASH, it should help us determine if there are any other significant contributors to merge failures.

Accordingly, I propose we add a step to the beginning of the workflow that checks reads for N content: if the number or percentage of bases in the read are Ns, the read pair should be discarded. We should also track whether R1 is the offender, R2, or both. And when it is only one mate that is the offender, the other mate should be output to a separate file for subsequent examination.