Open averagehat opened 9 years ago
We've already tried this - is massacres or has massacred our data in the past which is why we didn't use it. You are welcome to run some benchmarking to see how it deals with our MiSeq datasets. @necrolyte2 and I have also considered playing with removing primer sequences via primer files which would reduce primer induced mutational changes that could find their way into variant analysis at low thresholds <5%.
Feel free to try as much as you'd like with optimizing our datasets - however this all needs to be optional/configurable - not default for now and know that these additions become less useful the more data they massacre so we have to find that balance. There is a limit to the data we can pull out of Miseq and a limit to how far wet lab and 'clean' things for the time being - so we'll be responsible for balancing good quality datasets with still obtaining enough data for analysis.
Thanks!
@necrolyte2 , do you recall how you were removing duplicates? Was it with samtools?
I've removed dups with rmdups in samtools. Not sure if @necrolyte2 has played with this. I've also played with Picard's tools on removing dups and two other programs on which I am blanking on for the moment, will look. That's how I know it slaughtered our data but I also didn't run it on high depth MiSeq. I ran on lower depth MiSeq and 454. USAMRIIDS nGen pipeline also removed dups - proprietary so I don't know what tool is used, but it slaughtered our data too.
Probably samtools, but I honestly don't remember what we did as it was a long time ago. I'd just take a data set that you run through the pipeline without any filtering and then compare the vcf mutations to the same dataset run through with samtools rmdup and then also compare with picard
I think GATK includes a utility to mark (not remove) PCR duplicates; that might be a good solution for now.
@mmelendrez, @InaMBerry suggested we try using rmdup
with current data sets and see how that works with ngs_mapper. When I have time I will work on adding it as an optional step. Let me know if you have any thoughts in the meantime.
Sure, that's fine. Assign me an issue when more testing is needed/ready to be done.
pinging @figueroakl
@mmelendrez samtools (as well as some other tools like picard) include facilities for removing duplicate reads (e.g.
samtools rmdup
). This can help prevent PCR duplicates from masquerading as high-depth SNPs.Does/should the pipeline handle this?