Open dereneaton opened 8 years ago
The goal of mindepth is to help identify sequencing error, if i understand correctly. If illumina error rate is < 0.1% I guess the question is, what are the chances there is a bad read in the non-overlapping regions, and is this worth trying to account for.
I don't have a good sense of how many reads typically merge and what the distribution of overlap lengths is, so yeah this seems like a tricky problem. It's probably "safest" to count incomplete merges as 1x, but it's probably easiest to count them as 2x.
My belief is that if we count them as 2x then the sequencing errors in the flanking regions will wash out in the mix, just by virtue of the volume of data we're processing. I'd be willing to hear other arguments tho.
paired reads that are completely merged should count as 2 instances of that read, since every base was counted twice, right? This should be taken into account when counting mindepths.
But what do we do for merged reads that were only partially overlapping? We could err on the side of counting them as double, since we require reads to be merged pretty far (minovlen=20).
Not sure what to do here...