benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
466 stars 142 forks source link

Problem with taxa assignation #1971

Open martina-nasuelli opened 3 months ago

martina-nasuelli commented 3 months ago

Hello Ben and team, I am having trouble with the taxa assignation using dada2 and the MIDORI reference database. I've sequenced a Metazoan COI library based on Leray et al. 2013 primers targeting an amplicon of 313 bp. Based on the quality profile plots (attached), I've set these filters and parameters to get the taxa assignation: out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft = c(0,26), truncLen = c(240,180), truncQ = 5, maxN = 0, maxEE = c(2,2), rm.phix=TRUE, compress=TRUE, multithread=TRUE)

errF <- learnErrors(filtFs, multithread=TRUE) errR <- learnErrors(filtRs, multithread=TRUE)

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE, minOverlap = 12, returnRejects = TRUE)

seqtabfilt.nochim <- removeBimeraDenovo(seqtabfilt, method="consensus", allowOneOff = TRUE, multithread=TRUE, verbose=TRUE)

head(track) input filtered denoisedF denoisedR merged nonchim R1.0A 77811 53902 51110 52772 50672 33272 R1.0C 82972 58557 55671 57493 55300 40178 R1.0E 52498 45084 42800 44256 42565 24863 R1.0H 58371 43242 40987 42617 40833 24986 R1.0L 39734 33325 31069 32586 30850 21042 R1.1A 55957 41674 38200 39987 37869 26640

sum(seqtabfilt.nochim)/sum(seqtabfilt) [1] 0.7844993

taxa.MIDORICOI <- assignTaxonomy(seqtabfilt.nochim, "../MIDORI2_UNIQ_NUC_GB259_CO1_DADA2.fasta.gz", tryRC = TRUE, taxLevels = c("Phylum","Class","Order","Family","Genus","Species"), minBoot = 50, multithread=TRUE)

After these steps, the taxa assignation I've got is mostly NA or no assignation lower than Order. In the attached files you can find the quality profile plots of the R1 and R2 fastqs and the taxa and count csv I've obtained. What I am missing here? Is there something I have done wrong in the library preparation or I am analyzing the data incorrectly? COI_analysis.zip

Thank you in advance,

Martina Nasuelli

benjjneb commented 3 months ago

I would be a bit concerned about the fraction of reads you are losing in the chimera removal step. I'm not familiar with these primers and library preparation -- is it true that the reverse primer is 26nts long and is sequenced, but that the forward primer is not sequenced? That is what it looks like based on your filterAndTrim(..., trimLeft=c(0,26)) choice.

In addition, why have you chosen mergePairs(..., returnRejects=TRUE)? This is really only intended for troubleshooting, not as part of a normal workflow. removeBimeraDenovo(..., allowOneOff=TRUE) is also not a recommended option anymore.

I am not an expert on CO1 metabarcoding, but I would start by BLAST-ing some of your top ASVs against a broad database like nt, and comparing the best hits to what you are getting from taxonomic assignment against MIDORI. Do they mostly match up? Or is there an obvious mismatch even among the abundant ASVs?

martina-nasuelli commented 3 months ago

Hello Benjamin, I've rerun the dada2 script based on your suggestions, adding the Trimleft option for the forward primer (26bp). I've obtained a higher percentage of non-chimeras (0.9928853), but still got no taxonomy for my ASV. Yes, I've blasted the generated sequences. The primers I used are meant to amplify the CO1 gene for Metazoan; most of the best matches I obtain from BLAST, in terms of query over and similarity, are Fungi. I am referring to 98% query cover and 90% similarity, which is very low. Some Metazoan are returned with 87-88% of similarity.

Thank you again

Martina Nasuelli, PhD student Università del Piemonte Orientale Dipartimento per lo Sviluppo Sostenibile e la Transizione Ecologica (DiSSTE) Piazza Sant'Eusebio 5, 13100 - Vercelli, Italy

Google Scholar https://scholar.google.com/citations?hl=it&view_op=list_works&gmla=AJsN-F7yLKtw4EQIsjnfDMRSWzFmyF4E48O6LKKYsKLNV08MRAaezbKyZIVeZwdEvgWSd4-IHaEipDdykvftRsRxTOoB9XBWScIzsMYs5xaa-k21Fme9Sotb2Efe7Dt6_RfK1K9ey5WisciHR7W4GvHjF8Ct-gPvbA&user=NsdvzOIAAAAJ ResearchGate https://www.researchgate.net/profile/Martina-Nasuelli-2 Twitter https://twitter.com/MNasuelli

Il giorno gio 13 giu 2024 alle ore 22:15 Benjamin Callahan < @.***> ha scritto:

I would be a bit concerned about the fraction of reads you are losing in the chimera removal step. I'm not familiar with these primers and library preparation -- is it true that the reverse primer is 26nts long and is sequenced, but that the forward primer is not sequenced? That is what it looks like based on your filterAndTrim(..., trimLeft=c(0,26)) choice.

In addition, why have you chosen mergePairs(..., returnRejects=TRUE)? This is really only intended for troubleshooting, not as part of a normal workflow. removeBimeraDenovo(..., allowOneOff=TRUE) is also not a recommended option anymore.

I am not an expert on CO1 metabarcoding, but I would start by BLAST-ing some of your top ASVs against a broad database like nt, and comparing the best hits to what you are getting from taxonomic assignment against MIDORI. Do they mostly match up? Or is there an obvious mismatch even among the abundant ASVs?

— Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1971#issuecomment-2166686311, or unsubscribe https://github.com/notifications/unsubscribe-auth/BJEVCRCDKICUZHZTR3AO5CLZHH4UHAVCNFSM6AAAAABJHZL2QWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRWGY4DMMZRGE . You are receiving this because you authored the thread.Message ID: @.***>

benjjneb commented 3 months ago

The primers I used are meant to amplify the CO1 gene for Metazoan; most of the best matches I obtain from BLAST, in terms of query over and similarity, are Fungi. I am referring to 98% query cover and 90% similarity, which is very low. Some Metazoan are returned with 87-88% of similarity.

One possibility is that these primers are not performing as expected, and so you are getting lots of off-target amplification. That would be my first guess based on what you describe here. It may also be worth looking at the alignments for your top hits --- are the mismatches clustered to a specific part of the reads? If so, it could be some technical issue that is solvable. If not, it is consistent with the non-specific priming idea.