benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

Difference between samples at merging step #1737

Closed QGaelle closed 4 months ago

QGaelle commented 1 year ago

Hi @benjjneb, hi everyone,

I am currently looking for the best filtering parameters for my data and have been running some tests on a few samples. We have sent samples for sequencing in several batches, 6 runs total. The idea is to process each run separately and then merge them before the chimera removal step.

I work with coral microbiome data and have run a test on four samples from the same run.

Primers used: 16S Bactéries (Parada et al. 2016) - 411 pb (universel V4-V5) 16Sbac-515F GTGYCAGCMGCCGCGGTAA 16Sbac-926R CCGYCAATTYMTTTRAGTTT

Parameters I chose:

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=c(10,5), truncLen=c(240,230), maxN=0, maxEE=c(2,5), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)

I used TrimLeft to remove the remaining bases of the primers. I had to remove the heterogeneity spacers first using the Trimming_HS_Metagenomics script which worked great as discussed in #1651.

I would have two questions on the output of my test. I would like to make sure I have optimized the parameters before I start the process on the whole dataset.

Output of the test:

           input filtered denoisedF denoisedR merged
ERM-10     16611     9013      8966      8938   8404
PETS-10    18188     8493      8457      8439   3525
Sed-ERM-A  21558     6052      5515      5486   4023
Sw-ETS2-3u 31449    10272     10245     10189   9824

Q1: I know you’ve been saying that anything above ~25% of loss is "acceptable" for filtering and trimming #1133 but what about when I don’t have thousands of reads left like here with less than 10 000 reads for each sample? Is it still ok?

Q2: I have made sure that my reads still overlap after truncation and yet, it would seem that there is quite a high percentage of loss for the second sample. What could explain the high % of loss for this sample and what could explain the difference between the samples at this step?

I have attached the full output document in case it could be useful: output.txt

Thanks for your help, Gaëlle

benjjneb commented 1 year ago

Q1: I know you’ve been saying that anything above ~25% of loss is "acceptable" for filtering and trimming https://github.com/benjjneb/dada2/issues/1133 but what about when I don’t have thousands of reads left like here with less than 10 000 reads for each sample? Is it still ok?

This always depends on the questions you are trying to answer, but yes it's probably fine. Lower library sizes just mean less sensitivity to relatively rare members of the community.

Q2: I have made sure that my reads still overlap after truncation and yet, it would seem that there is quite a high percentage of loss for the second sample. What could explain the high % of loss for this sample and what could explain the difference between the samples at this step?

In my experience, the most common reason to see these sorts of large variations between samples is because there is some sort of either off-target amplification, or unexpected amplicon sizes, in some of the samples. This is something you could investigate in your test samples. mergePairs(..., returnRejects=TRUE) will return a data.frame that also includes the pairs of ASVs that failed to merge. In the samples with low merge performance, is there something different about those ASVs (considering perhaps just the forward ASVs) and the ASVs that do merge? I usually poke at that by just BLAST-ing some of the high abundance merging and non-merging pairs.