Single-end/Paired-end results discrepancies

OLC-Bioinformatics / ConFindr

Intra-species bacterial contamination detection

https://olc-bioinformatics.github.io/ConFindr/

MIT License

22 stars 8 forks source link

Single-end/Paired-end results discrepancies #22

Closed fredericQC closed 1 year ago

fredericQC commented 3 years ago

Hello there, We sequenced 643 bacterial samples with PE250. I used ConFindr to find any cross-contaminations between them and half of them (319) resulted in a contaminated status (with a mean estimated contamination level of 10%). Trying to understand why, one of my test was using ConFindr, but on each read individually (R1 and R2 instead of both together). This gave me 556 samples being clean and 87 having R1 and/or R2 contaminated (only 27 having both reads contaminated). The samples have a mean coverage of 50X.

I ran ConFindr as follow: confindr.py -i $inDir -o $outDir -d $ConFindrDir -t 20 -q 30 -bf 0.05

Would you know why using "single-end" reads gives results that different?

Thanks!

Fred

fredericQC commented 3 years ago

Looking at the *contamination.csv files, most contamination calls were made using only 2 reads (the minimum threshold I think), so it might just be noise (hopefully).

adamkoziol commented 3 years ago

Hi Fred,

That is a very large discrepancy. I believe that you're correct that the calls were due to to noise. I'm currently working on updates to ConFindr that integrate read overlap, coverage, and quality scores to determine appropriate cutoffs, which should (hopefully) address this particular issue.

In the meantime, are you willing to downsample a few of the reads to confirm that it is, in fact, noise causing the high contamination results?

Best, A

fredericQC commented 3 years ago

Hi Adam, Thanks for the support and for this software. I will think about it, but i'm not sure if it's worth it. I say this because I just tried using both reads (PE) with '-b 3', so raising the limit to 3 instead of 2, and now only 9 samples are contaminated (instead of 319). Also, when I look at the *contamination.csv for those 9, they have higher contaminants reads number (>4), more loci affected and there is overlap between R1 and R2 results (when ran individually).