benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

Quality plot profile refence for filterAndTrim truncLen values #1936

Closed luigallucci closed 6 months ago

luigallucci commented 6 months ago

Hi!

I'm working on 16S sequences from Illumina 2x300, with targeting regions V3-V4 (341F-785R)... Forward:

Screenshot 2024-04-25 at 15 47 48

Reverse:

Screenshot 2024-04-25 at 15 48 05

These are the quality profile after cutadapt.

We were trying to target the lenght 443+25 as reference for the value truncLen, with the following set:

out <- filterAndTrim(cutFs, filtpathF, cutRs, filtpathR,
    truncLen = c(240, 230),
    maxEE = c(2,3),
    maxN = 0,
    #truncQ = 2,
    rm.phix = TRUE,
    compress = TRUE,
    verbose = TRUE,
    multithread = 20
  )

but the best percentage retained after this step is 62%...do you have any suggestion how to choose the best value?

luigallucci commented 6 months ago

I forgot to add for the primers:

image

benjjneb commented 6 months ago

Maximally optimizing the number of reads passing filtering is not necessary. What you want is something "good enough" to keep a reasonable fraction of the reads and preserve overlap for later merging. You are probably fine with what you have.

If you want to try to bump up the numbers a bit, after primer removal the expected length of this amplicon is only ~400-420 nts, so you could drop total truncation lengths by 20 and still be fine for merging (although double-check that holds up in the read tracking at the end of the dada2 worfklow).

luigallucci commented 6 months ago

fine, I got that, thank you!

Anyway, I have another question...

I was dealing with the chimera removal, and my reads stay more or less on the same level, but...ASV instead drop significantly after this step, e.g.

> seqtab.nochim1 <- removeBimeraDenovo(seqtab1, method="consensus", multithread= 20, verbose=TRUE)
Identified 2223 bimeras out of 3201 input sequences.
> seqtab.nochim1 <- removeBimeraDenovo(seqtab1, minFoldParentOverAbundance = 8, method="consensus", multithread= 20, verbose=TRUE)
Identified 1952 bimeras out of 3201 input sequences.

same samples, same process, different argument for the minFoldParentOverAbundance.

Do you have some guess why this is happening?

I'm using cutadapt to remove the primers:

Command line parameters: -g CCTACGGGNGGCWGCAG -a GGATTAGATACCCBDGTAGTC --discard-untrimmed -m 50 -G GACTACHVGGGTATCTAATCC -A CTGCWGCCNCCCGTAGG --discard-untrimmed -m 50 --revcomp -j 20 -n 2 -o /bioinf/home/lgallucc/M190_analysis/A_pool1/Renamed//cutadapt_path/9_R1.fastq.gz -p /bioinf/home/lgallucc/M190_analysis/A_pool1/Renamed//cutadapt_path/9_R2.fastq.gz /bioinf/home/lgallucc/M190_analysis/A_pool1/Renamed//filtN/9_R1.fastq.gz /bioinf/home/lgallucc/M190_analysis/A_pool1/Renamed//filtN/9_R2.fastq.gz

                 Forward Complement Reverse RevComp
FWD.ForwardReads       2          0       0       0
FWD.ReverseReads       0          0       0       0
REV.ForwardReads       0          0       0       0
REV.ReverseReads       0          0       0       0
FWD.ForwardReads       0          0       0       8
FWD.ReverseReads      15          0       0       0
REV.ForwardReads       0          0       0       0
REV.ReverseReads       0          0       0       0
FWD.ForwardReads       1          0       0       2
FWD.ReverseReads       6          0       0       0
REV.ForwardReads       0          0       0       0
REV.ReverseReads       1          0       0       0
FWD.ForwardReads       2          0       0       7
FWD.ReverseReads      15          0       0       0
REV.ForwardReads       0          0       0       0
REV.ReverseReads       1          0       0       0
FWD.ForwardReads       0          0       0       0
FWD.ReverseReads       0          0       0       0
REV.ForwardReads       0          0       0       0

even if, seems that some primers are not removed, I got that increasing the minimum lenght this not-0 value are disappearing, probably because there is a variability in the lenght of the sequences produced by the sequencing center (n.b. this are new NextSeq data, despite the first example that I provided you).

benjjneb commented 6 months ago

I was dealing with the chimera removal, and my reads stay more or less on the same level, but...ASV instead drop significantly after this step

This is normal (as mentioned in the dada2 tutorial "Remove chimeras" section).

Chimeras tend to be very diverse, as you can imagine since they could come from combinations of any pair of real ASVs, and with different breakpoints between them, and tend to appear at low abundances since they are most often formed after the first few PCR cycles. So in typical data they can make up a substantial fraction of ASVs, but should be a low fraction of reads.

luigallucci commented 6 months ago

@benjjneb thank you for your reply!