benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Dual amplicon processing and filtering #1889

Closed AnishMalde closed 4 months ago

AnishMalde commented 8 months ago

Hello all. We have extracted DNA from tomato leaves and roots. The DNA was amplified using the 16S primers 515f/806r and ITS primers gITS7/ITS4 (ITS2 region). We used PNA clamps with the 16S PCRs to prevent amplification of mitochondria (mPNA) and chloroplast (pPNA), because we are interested in profiling the microbial communities. The amplicons were indexed together and sequenced using the Illumina 500 cycle (250 PE) v2 flow cell. I started processing the data using the dada2 pipeline in R, separately for 16S and ITS - using cutadapt to remove the respective primer sequences for each gene. I used the following filtering parameters for 16S: filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(2, 5), truncQ = 2, minLen = 200, maxLen = 350, rm.phix = TRUE, compress = TRUE, multithread = FALSE). I used the following filtering parameters for ITS: filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(2, 2), truncQ = 2, minLen = 240, maxLen = 340, rm.phix = TRUE, compress = TRUE, multithread = FALSE) I'm using the Silva database (silva_nr99_v138.1_train_set.fa and silva_species_assignment_v138.1.fa) for bacterial taxonomy and Unite (sh_general_release_dynamic_s_25.07.2023.fasta) for fungal taxonomy. Issues: 1) For the 16S data, I'm getting the top 8 ASVs, out of 5,000, assigned to Solanum lycopersicum (tomato), accounting for 50% of total reads (length = 353bp). How can I filter these out using a size filtration method? Am I getting these ASVs because the file still contains the ITS sequence? 2) For the ITS data, the top 2 ASVs are assigned to Solanum lycopersicum (tomato), accounting for 60% of total reads (length = 313bp). Same questions as above. 3) Is there a better/easier method to deal with dual amplicon processing in R using the dada2 pipeline(s)?

Please let me know if you need more info to help me with this query. Appreciate your time and effort.

Output of sequence table, after chimera removal using the 16S workflow. As you can see, a majority of the ASVs are 253bp in length, i.e. specific bacterial sequence. How do I get rid of the non-specific sequences?

seqtab.nochim = removeBimeraDenovo(seqtab, method="consensus", multithread=FALSE, verbose=TRUE) Identified 973 bimeras out of 6219 input sequences. dim(seqtab.nochim) [1] 48 5246 sum(seqtab.nochim)/sum(seqtab) [1] 0.96331 table(nchar(getSequences(seqtab.nochim)))

image

benjjneb commented 8 months ago

A few answers below:

0) This wasn't asked, but -- for 16S I would not recommend using cutadapt here, as there should not be any primers on your reads. I would recommend using truncLen to make all reads a constant length to improve sensitivitiy. See the dada2 tutorial for an example of analyzing data from the same primer set. For ITS you will need to use cutadapt since the amplicon lengths are so variable. But, be careful with your minLen/maxLen parameters. Is it possible that there are legitimate ITS amplicons that are less than 240nts? Your current approach is throwing all of those away.

1) See example in the "Considerations for your own data" box in the dada2 tutorial section on constructing the sequence table: https://benjjneb.github.io/dada2/tutorial.html#construct-sequence-table

You are probably getting those sequences because while PNA clamps help, they aren't perfect, and so a significant amount of mt amplification is still happening (but probably much less than without PNA clamps).

2) As above. You can also just filter out based on taxonomic assignment using base R commands to remove columns of the sequence table.

3) From what you've posted it seems like everything is working pretty well actually.

AnishMalde commented 8 months ago

@benjjneb Thank you very much for your prompt reply and suggestions. I should have clarified that for each sample file, I have mixed amplicon data, i.e. both 16S and ITS. I was recommended by a colleague to process them twice, once using the 16S pipeline and then the ITS pipeline (as per your dada2 tutorials). I should have also mentioned that we used the Zymo positive control (DNA from 10 bacterial and 2 fungal/yeast strains). When I processed the samples using the 16S pipeline and used the SILVA database for taxonomic classification, I still got ASVs associated with the fungal strains in my Zymo sample! Further, the PNA clamps have definitely helped a lot - no ASVs associated with chloroplasts and only a handful for mitochondria (accounting for minority of reads). After in silico filtering out chloroplast and mitochondria, the remaining 'Solanum' ASVs are neither from chloroplast or mitochondria. I guess the most confusing part for me is how to effectively separate my 16S and ITS sequences from the same sample file. Thanks again for your time.

benjjneb commented 8 months ago

I guess the most confusing part for me is how to effectively separate my 16S and ITS sequences from the same sample file.

I know this isn't helpful for dealing with this data, but the most effective answer here is: Don't use the same barcodes for different amplicons. We recently went through this process with another group we are collaborating with (they actually put 5 amplicons on the same sample barcodes) -- while you can mostly fix it bioinformatically, it is not worth whatever minor gain in costs doubling up barcodes gets you, and using unique barcodes for every sample/amplicon is more effective than anything you could achieve bioinformatically.

As for what to do here: Are your primers sequences at the start of the reads? If they are, then cutadapt (or other methods that detect barcodes and bin reads accordingly) work well. If they aren't, it's harder but can be done via classification methods.

AnishMalde commented 8 months ago

Thank you Ben. You're a legend! I'll check where the primer sequences are - i.e. whether they are at the start or not (I'm a novice at this, so still figuring things out). I guess we'll persevere with these separate pipelines and split them using classification methods (at the risk of losing reads, but hopefully maintaining some decent level of accuracy). Truly appreciate your time and help.

cjfields commented 7 months ago

Just to add: we have the very same issue with another project, which is essentially bleed over from barcode hopping (these were from COX1 amplicons). The only option we had was to try to estimate the expected frequency of hopping so we could maybe point to a lower threshold for confidences in the data.