benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

nonchim percentage #1921

Closed murheisa closed 5 months ago

murheisa commented 7 months ago

Hello, I'm working with some illumina 16s V3V4 sequences, and I'm not sure how can I improve the non chimeric sequences. I have used the next for filter and trim

out<-filterAndTrim(fnFs,filtFs,fnRs,filtRs, truncLen=c(270,210),maxN=0,trimLeft=c(10,10), maxEE=c(2,2), truncQ=2, rm.phix=TRUE,compress=TRUE, multithread=TRUE)

In some samples, more than 50% of the sequences where chimeras, what can I do?

image

I have also tried truncLen (290, 190) maxEE (2,4), some samples whit less chimeras, but in some others more chimeras

image

and also truncLen (290, 180) maxEE= c(2,2) which is even worst

image

Any suggestions?

benjjneb commented 7 months ago

You need to keep at least 480 total nucleotides to get enough overlap between R1 and R2 for proper merging of this amplicon. The first two truncLen setting do that, but not the last one, which is why there is a huge drop at the merging step in that third table.

As for chimera, as noted in the tutorial this is almost always because there is unremoved primer sequence. The "Illumina" V3V4 16S protocol sequences the forward and reverse primers at the start of R1/R2 respectively. They need to be removed. The forward primer is 17 nts and the reverse primer is 21 nts (assuming this is the protocol you are using). So, trimLeft=c(17,21) will remove the full primers, whereas trimLeft=c(10,10) did not.

murheisa commented 7 months ago

You need to keep at least 480 total nucleotides to get enough overlap between R1 and R2 for proper merging of this amplicon. The first two truncLen setting do that, but not the last one, which is why there is a huge drop at the merging step in that third table.

As for chimera, as noted in the tutorial this is almost always because there is unremoved primer sequence. The "Illumina" V3V4 16S protocol sequences the forward and reverse primers at the start of R1/R2 respectively. They need to be removed. The forward primer is 17 nts and the reverse primer is 21 nts (assuming this is the protocol you are using). So, trimLeft=c(17,21) will remove the full primers, whereas trimLeft=c(10,10) did not.

Hello, thank you for your answer, I´m currently trying to use your suggestion, but it's getting very difficult to do so, because R is constantly crashing with a segfault error every time multithread appears in the pipeline. In some parts I´ve used multithread FALSE and worked, but in some other cases TRUE or FALSE is the same. I use a MacBook Air m1 with ventura 13.3, R version 4.3.3, dada2 1.28 and phyloseq 1.44, is there something I can do to avoid this? I´m stuck in assign taxonomy and removeBimera had to be done with the method per-sample, because consensus crashed too, even with multithread TRUE and FALSE.

benjjneb commented 7 months ago

Can you be more specific about where things are crashing? At the same step every time? If so, what step and what is the exact command?

murheisa commented 6 months ago

Hello again, I'm still struggling with these samples. I´ve tried this:

out<-filterAndTrim(fnFs,filtFs,fnRs,filtRs, truncLen=c(270,210),maxN=0,trimLeft=c(17,21), maxEE=c(2,2), truncQ=2, rm.phix=TRUE,compress=TRUE, multithread=TRUE)

But the chimera percentage still very high

             input filtered denoisedF denoisedR merged nonchim

L01 137801 125627 118146 121853 52716 42240 L40 177618 161932 157310 159551 19444 16767 M138 326862 297453 297296 297332 53815 42885 M139 131766 124244 123934 124104 15866 3492 M2Rx 228070 202583 201391 201288 96602 18316 M11 279430 258187 200398 227768 83872 62564

An I noticed that the distribution for the sequence length is like this

hist

And the rarefaction plot looks like this rarf.pdf

Considering that those small sequences, are used for the ASVs, I thought about removing those sequences shorter than the V3V4 length seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 400:480]

But the rarefaction curve looks pretty much the same

rarf.pdf

Considerening the previous example, I decided to also try the 18S procedure for the primer removal, thinking that primers weren't totally removed in the previous, but it didn't improve

filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(2, 2), truncQ = 2,minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = TRUE)

             Forward Complement Reverse RevComp

FWD.ForwardReads 27082 0 0 0 FWD.ReverseReads 0 0 0 1 REV.ForwardReads 6 0 0 187 REV.ReverseReads 130766 0 0 0

             Forward Complement Reverse RevComp

FWD.ForwardReads 0 0 0 0 FWD.ReverseReads 0 0 0 0 REV.ForwardReads 6 0 0 0 REV.ReverseReads 0 0 0 0

input filtered denoisedF denoisedR merged nonchim

L01 137766 124396 112251 119705 92495 59240 L40 177285 159617 151081 156676 133345 82650 M138 326784 285565 285159 285247 282598 203921 M139 131678 115197 114659 114941 113778 84513 M2R 227893 142105 139900 139123 129774 106867 M11 277261 254071 157838 219487 89124 69075

I don't know how to proceed in order to decrease the chimeras, as well as the huge amount of ASVs in one sample. These are marine mammal skin microbiome, which I have compared with sediment and sea water samples, and that one sample has too many ASVs , which doesn't look biological and don't know how to filter that.

Thank you in advance for your help

benjjneb commented 6 months ago

Can you confirm visually, by inspecting an example fastq file, that the forward primer always appears at the start of the forward reads? Or is there any variability in where the forward primer shows up in the reads? (e.g. incomplete forward primers, or forward primers that sometimes start after the first nucleotide position)

murheisa commented 6 months ago

This is the forward sequence of one of the samples:

Captura de pantalla 2024-05-01 a la(s) 22 34 05

And reverse of the same sample

Captura de pantalla 2024-05-01 a la(s) 22 35 37

I think the primers look good

murheisa commented 6 months ago

Hello again, I tried tightening the expected errors parameter (maxEE),

truncLen=c(270,210),maxN=0,trimLeft=c(17,21), maxEE=c(1,1), truncQ=2, rm.phix=TRUE,compress=TRUE, multithread=TRUE)

I thought It would improve the amount of non chimeric sequences that are formed, but its very similar to the maxEE(2,2 )

image

and the rarefaction curve in one of the samples is still too high, in which the ASVs are barely identified to something else than phylum. rarf.pdf

I think that one sample with too many asvs, has so many chimeras that can´t be removed by filter and trim, so they're entering to the error modeling, and forming ASVs that can´t be identified by silva. I would like to know if I can filter this in the next part of the pipeline, or what do you recommend, this one sample has 30% of non chimeric, while some other samples have 50%, and the rest nearly 90%. Should I process the samples with too low non chimeric sequences on a separate pipeline from the ones with 90% and then merge the phyloseq object?

benjjneb commented 6 months ago

Are all of these samples being amplified and sequenced using the same methods? Or are some of the samples being processed in different ways (e.g. different PCR protocol)?

In general, DADA2 should be run on batches of samples that were all processed in the same way especially at the PCR/sequencing stage.