double de-replicate count for completely merged reads

dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets

GNU General Public License v3.0

72 stars 40 forks source link

The goal of mindepth is to help identify sequencing error, if i understand correctly. If illumina error rate is < 0.1% I guess the question is, what are the chances there is a bad read in the non-overlapping regions, and is this worth trying to account for.

I don't have a good sense of how many reads typically merge and what the distribution of overlap lengths is, so yeah this seems like a tricky problem. It's probably "safest" to count incomplete merges as 1x, but it's probably easiest to count them as 2x.

My belief is that if we count them as 2x then the sequencing errors in the flanking regions will wash out in the mix, just by virtue of the volume of data we're processing. I'd be willing to hear other arguments tho.

dereneaton / ipyrad

double de-replicate count for completely merged reads #113