benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Strange pattern of the seqtab #1350

Closed xunwenchen closed 4 months ago

xunwenchen commented 3 years ago

Dear Benjamin and team members, I found that my seqtab has a regular and strange pattern as shown in the attached capture, i.e., for each ASV, its counts are zero for every three samples, which seems a regular pattern. I found this when I tried to plot NMDS using phyloseq, generating a plot with an uncommon pattern (only 3 dots can be seen). I used "vegan" to plot NMDS using the same seqtab, the NMDS plot was the same as plotted by phyloseq. If I use the usearch+vegan, the NMDS plot is normal (several clusters of different treatments).

I wonder if any silly mistake in my code? It is grateful to have your suggestions.

seqtab_capture

Command #########

library(dada2); packageVersion("dada2") path <- "data/16s"

fnFs <- sort(list.files(path, pattern="_1.fq.gz", full.names = TRUE)) fnRs <- sort(list.files(path, pattern="_2.fq.gz", full.names = TRUE))

sample.names <- sapply(strsplit(basename(fnFs), "_"), [, 1)

plotQualityProfile(fnFs[1:2]) plotQualityProfile(fnRs[1:2])

filtFs <- file.path(path, "filtered", paste0(sample.names, "_1_filt.fq.gz")) filtRs <- file.path(path, "filtered", paste0(sample.names, "_2_filt.fq.gz")) names(filtFs) <- sample.names names(filtRs) <- sample.names

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft = c(32, 20), truncLen=c(180, 150), maxN=0, maxEE=c(1,1), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=FALSE) head(out) reads.in reads.out A601.1_split_1.fq.gz 92753 87849 A601.2_split_1.fq.gz 73055 69326 A601.3_split_1.fq.gz 69958 65949 A602.1_split_1.fq.gz 85154 80487 A602.2_split_1.fq.gz 108292 102136 A602.3_split_1.fq.gz 79428 75064

errF <- learnErrors(filtFs, multithread=TRUE) errR <- learnErrors(filtRs, multithread=TRUE)

plotErrors(errF, nominalQ=TRUE)

derepFs <- derepFastq(filtFs) derepRs <- derepFastq(filtRs)

names(derepFs) <- sample.names names(derepRs) <- sample.names

dadaFs <- dada(derepFs, err=errF, multithread=TRUE) dadaRs <- dada(derepRs, err=errR, multithread=TRUE)

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)

seqtab <- makeSequenceTable(mergers)

View(seqtab)

xunwenchen commented 3 years ago

Some more info, seqtab2 is with non-target length removed. Thank you

head(track) input filtered denoisedF denoisedR merged seqtab nonchim seqtab2 nonchim2 A601.1 92753 87849 86321 84929 75497 75497 71051 75187 70935 A601.2 73055 69326 67957 66810 59157 59157 56102 58964 56088 A601.3 69958 65949 64545 63614 55599 55599 52787 55425 52743 A602.1 85154 80487 78115 76844 61788 61788 59575 61583 59502 A602.2 108292 102136 99657 97720 79398 79398 76489 79121 76443 A602.3 79428 75064 73228 71715 58323 58323 56781 58182 56754

benjjneb commented 3 years ago

Can you say a little bit more about the different runs(?) that you are merging together here. In other words, what is the differnce at the sample level between A601.1, A601.2 and A601.3?

In addition, how were these samples processed before entering into this final sequence table?

xunwenchen commented 3 years ago

Hi Ben, A601.1, A601.2, and A601.3 are repeated samples of the same sampling point. A601, A602, and A603 are replicates of the same treatment. It is expected that they are similar. A6301 is from a different treatment.

What I've got are paired-end reads with unremoved barcodes and primers. I have "A601.1_split_1.fq.gz" and "A601.1_split_2.fq.gz", etc. The target was the V4 region about 250 bp. One of the sequence in "A601.1_split_1.fq":

@HWI-D00477:954:HYF3GBCX2:1:1101:6022:2245 1:N:0:GGCGGAAT AGCTGTCGGTAAGGACTACAAGGGTTTCTAATCCCGTTCGCTCCCCATGCTTTCGCACTCCAGCGTCGGTAGGGACCCAGAGAGCTGCCTTCGCTTTTGGCGTTCCTTCGTAGATCTGCGGATTTCACCCCTACACACGAAATTCCACTCTCCTCTGTCTCACTCAAGTGAATTGGTTTCGAGAGCATTCCACCAGTTTTTGGCGACTTTCACTTTCAACCCGATTCACCGCCTACGTGCCCTTTACGCC + DDCDDIIIIIIIIIHIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIHIIIIIIIIIIIIIIIIIIIIGIIHIIIIIIIIIHIIIIIIIIIIIIIIHIHHIIIIHIHIIHIIHIHIHIIIIIIHIEHH

Please let me know if you need more information. Thank you for your suggestions.

benjjneb commented 3 years ago

What I don't understand right now is why there would be an ASV consistently appearing in ".1" samples, and not others. Is there a commonality between ".1" samples? For example, were they al processed in the same run? Or using a common library prep? This looks like a technical artifact to me, stemming from some similarity between the ways in which the .1 samples were created, which is why I am asking.

Can you also provide the primer sequences you are using?

xunwenchen commented 3 years ago

Hi Ben,

What I don't understand right now is why there would be an ASV consistently appearing in ".1" samples, and not others.

Sometimes it could appear in ".3" sample, see the last line of the ASV table capture, and two more captures below for the whole pattern.

were they al processed in the same run? Or using a common library prep?

I am not sure about these...what I know is that all the samples should be processed at the same time with the same lib prep (same batch). But let me try to confirm later.

Can you also provide the primer sequences you are using?

F: GGACTACHVGGGTWTCTAAT R: GTGCCAGCMGCCGCGGTAA

Appreciate your help!

2 3

benjjneb commented 3 years ago

Would it be possible to share the sequence table with me so that I can inspect these patterns more? It would be helpful to get a sense of how general these are and some summary statistics on their patterns when they show up, which would be easier to do myself.

Sharing via email would be fine, preferably at an RDS object. benjamin DOT j DOT calllahan AT gmail DOT com

xunwenchen commented 3 years ago

Hi Ben, thank you for spending time on this. The seqtab has been sent to your email address.

benjjneb commented 3 years ago

What I am seeing is that most of the ASVs in the table come in three different "flavors", one base sequences, when with an additional T at the start of the sequence, and one with an additional AT at the start of the sequence. I used this code to quickly explore this pattern:

library(dada2); packageVersion("dada2")
setwd("~/Desktop/statchin/")
st <- readRDS("seqtab_to_Ben.rds")
# Basic stats
dim(st) # [1]    54 14531
st[1:5,1:5]
plot(colSums(st))
# First few have majority of reads
unname(head(colSums(st), 10))
# First 6, 3 at 300-400k, then 3 at 100-150k
sq <- getSequences(st)
head(nchar(sq)) # [1] 253 254 255 251 252 253
# Length differences
outer(sq[1:3], sq[1:3], nwhamming, vec=TRUE)
# All zeros, identical up to length difference
outer(sq[4:6], sq[4:6], nwhamming, vec=TRUE)
# All zeros, identical up to length difference
outer(sq[1:6], sq[1:6], nwhamming, vec=TRUE)
# ASVs 1-3 and ASVs 4-6 represent two different sequences, with underlying length variation
nwalign(sq[[1]], sq[[2]]) # Sq 2 has an extra T at start
nwalign(sq[[1]], sq[[3]]) # Sq 3 has an extra AT at staert
nwalign(sq[[4]], sq[[5]]) # Sq 5 has an extra T at start
nwalign(sq[[4]], sq[[6]]) # Sq 6 has an extra AT at start

A quick visualization of the start of these length variations of the same sequence:

library(DECIPHER)
aln123 <- AlignSeqs(DNAStringSet(sq[1:3]))
aln123
Screen Shot 2021-06-14 at 12 38 32 PM

That pattern is repeated for basically all the abundant ASVs at least. This suggests that spurious length variation is either being introduced somewhere along the line. This looks to me like your library might have been prepared using "heterogeneity spacers", which intentionally introduce length variation into amplicon data. Alternatively, there might be different length barcodes for different samples. Either way, this spurious length variation is what is causing the non-overlaps in the sequence table you observed.

xunwenchen commented 3 years ago

Hi, thank you for the investigation. I am still trying to digest it.....meanwhile, I wonder if any suggestion? Thanks

benjjneb commented 3 years ago

You can use the collapseNoMismatch function to "collapse" the length variants into a single ASV. That will mostly fix this problem, but I'll admit the function is not very well optimized and so may take a while to run on your full sequence table.

The ideal solution is to figure out exactly what's going on with the library preparation (my guess) that is causing this length variation, and fix it upstream. That will help with the detection of rare variants that may be missed if a small number of reads gets split into these different length unique sequences during the dada denoising step. But collapseNoMismatch may well be good enough.