Unexplained batch effect

Fedorov113 commented 3 years ago

Hello and thank you for your great work, I've been using dada2 for more than 2 years now and it's amazing.

Currently I am working with human gut 16s V3-V4 250+250 sequencing data. The dataset includes 4 sequencing run, B1 - Hiseq, B3, B4, B5 - Miseq. I filter the reads this way:

Removing nextera adapters with trimmomatic
Removing primers with cutadapt
Filtering with filter and trim from dada2, maxEE=1.2, trunLen=(227,225)

Starting profiles:

Final R1 R1 qual Final R2

The distribution of total read counts is a bit different between runs:

After filtering I run the dada2 pipeline individually for each run with nbases=401912944, randomize=TRUE, MAX_CONSIST=20, B3, B4, B5 have very similar error profiles, B1 is a bit different which is expected, considering different platforms:

I then infer variants for each run individually using true pooling strategy, but the runs yield strikingly different number of ASVs: B1 36628 B3 54613 B4 60895 B5 34236

The distribution of ASV lengths looks ok (I shifted them a bit so the diffs are better visible) .

Then I combine sequence tables using mergeSequenceTables and run a chimera removal step with pooling, which results in 8266 ASVs.

Then I filter out ASVs with only 1 read and apply prevalence filter of 3%, remove too short ASVs(they are Mitochondrion sequences, I observed ASVs very similar (by BLASTn) to Human, boar, elk and trout) and I get 1921 ASVs at this point.

From the heatmap it is obvious that there are run B3 and B5 specific variants.

I blasted some of them and they don't look like contamination. Some of these unique features are pretty abundant in their's runs. I also applied LULU ASV curation algorithm with rather loose parameters and this sequences can be merged into very similar and more abundant parents, or, I still need to test this, become parents in their own runs. I also applied tip_glom from phyloseq to ASVs, agglomerating them based on the tree produced by tree insertion made by picrust2 (reference tree is based on 20000 full length 16s rRNA gene sequences taken from full hand-curated genomes from Integrated Microbial Genomes), and with h=0.2 it helps to resolve the issue.

So my questions are:

May I observe this extra ASVs due to the sequencing depths differences?
What are your thoughts on curation algorithms like LULU (https://www.nature.com/articles/s41467-017-01312-x)? -- Is there any sense in trying to remove chimeras in each run individually, then curate result with LULU, then merge and remove chimeras once more?
What are current best practices to combat this effect?
Maybe I am missing something? I don't really know what else I can debug/try, do you have any suggestions?

Thank you!

Fedorov113 commented 3 years ago

Well, I think I have 2 ideas:

Given that the error profiles are very similar for Miseq runs I could run them together. If there is contamination we will see it as it is now.
I could use a priors feature and supply DADA with the sequences I got during this first processing.

benjjneb commented 3 years ago

Yes I clearly see the B3 and B5 specific variants in your heat map, and I think it is great that you are taking these potential batch effects seriously.

May I observe this extra ASVs due to the sequencing depths differences?

To some degree, but not something as systematic as this where these batch specific ASVs are appearing exclusively but broadly within batches.

What are your thoughts on curation algorithms like LULU (https://www.nature.com/articles/s41467-017-01312-x)? -- Is there any sense in trying to remove chimeras in each run individually, then curate result with LULU, then merge and remove chimeras once more?

While they can be useful, I don't think that will be the answer to the pattern you are seeing here.

What are current best practices to combat this effect?

There are batch correction methods out there that have been used in microbiome data, e.g ComBAT and https://journals.plos.org/ploscompbiol/article?rev=2&id=10.1371/journal.pcbi.1006102. There are more general purpose methods for dealing with batch effects in multi-dimensional data as well (e.g. RUV), but I am not an expert in this area.

Maybe I am missing something? I don't really know what else I can debug/try, do you have any suggestions?

I would not throw out the idea of contamination just yet. Could you explain more about why they "don't look like contamination"? What are the taxonomic identities (e.g. genus) of these batch-specific ASVs? Are you working in a lower biomass environment?

I could use a priors feature and supply DADA with the sequences I got during this first processing.

That's probably the thing I would try in terms of messing the the DADA2 denoising bit.

benjjneb / dada2

Unexplained batch effect #1305