dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
72 stars 40 forks source link

double de-replicate count for completely merged reads #113

Open dereneaton opened 8 years ago

dereneaton commented 8 years ago

paired reads that are completely merged should count as 2 instances of that read, since every base was counted twice, right? This should be taken into account when counting mindepths.

But what do we do for merged reads that were only partially overlapping? We could err on the side of counting them as double, since we require reads to be merged pretty far (minovlen=20).

Not sure what to do here...

isaacovercast commented 8 years ago

The goal of mindepth is to help identify sequencing error, if i understand correctly. If illumina error rate is < 0.1% I guess the question is, what are the chances there is a bad read in the non-overlapping regions, and is this worth trying to account for.

I don't have a good sense of how many reads typically merge and what the distribution of overlap lengths is, so yeah this seems like a tricky problem. It's probably "safest" to count incomplete merges as 1x, but it's probably easiest to count them as 2x.

My belief is that if we count them as 2x then the sequencing errors in the flanking regions will wash out in the mix, just by virtue of the volume of data we're processing. I'd be willing to hear other arguments tho.