benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

Duplication of Species column when using addSpecies function #1539

Closed emankhalaf closed 4 months ago

emankhalaf commented 2 years ago

Hi @benjjneb

To assign the taxonomy of v4 sequences, I use 2 consecutive steps according to dada2 tutorials:

taxa <- assignTaxonomy(seqtab.nochim2, "/home/silva_nr99_v138.1_wSpecies_train_set.fa.gz", multithread=TRUE)

taxa <- addSpecies(taxa, "/home/silva_species_assignment_v138.fa.gz")

Then, I noticed that the taxonomy table has duplicated species column. The first column resulted from assignTaxonomy and the second from addSpecies. I noticed that the taxonomic classification at the species level has one of the followings:

  1. for some sequences/ASVs, both steps agree to the same taxonomy at the species level which is okay.
  2. some sequences were assigned to the species level from the first step only (first species column has taxonomy and the second is NA).
  3. some sequences were assigned to the species level from the second step only (first column is NA, the second column of species level has taxonomy). I am wondering how can I merge both prior to generating the phyloseq object for better taxonomic resolution.

Thanks! Eman

benjjneb commented 2 years ago

Our recommendations for species-level assignment from 16S data are:

(1) If using short-read 16S, use assignTaxonomy down to the genus level only, and then addSpecies. (2) if using long-read 16S, use assignTaxonomy down to the species level (i.e. use the wSpecies training fasta). Don't use addSpecies at all.

emankhalaf commented 2 years ago

@benjjneb Yes, this is what I did. The problem with the short reads. I do not know how to figure it out? Your help is much appreciated!

benjjneb commented 2 years ago

If it's short reads, you should be using the other Silva reference files, that does not have wSpecies in the file name. That will naturally coordinate with the addSpecies function.

emankhalaf commented 2 years ago

@benjjneb Oh, sorry! Will do that.

Thanks so much!

emankhalaf commented 2 years ago

@benjjneb

One more question about selecting proper v4 amplicon length as follows table(nchar(getSequences(seqtab.nochim))) 226 227 228 229 230 231 232 233 243 245 246 249 250 251 252 253 254 255 256 257 258 259 260 261 262 276 422 2 2 3 3 3 2 2 1 2 1 2 1 1 2 8 206 221 384 332 209 136 22 2 1 2 1 1

Then, I used

plotLengthDistro <- function(st) {
  require(ggplot2)
  tot.svs <- table(nchar(colnames(seqtab)))
  tot.reads <- tapply(colSums(seqtab), nchar(colnames(seqtab)), sum)
  df <- data.frame(Length=as.integer(c(names(tot.svs), names(tot.reads))),
                   Count=c(tot.svs, tot.reads),
                   Type=rep(c("SVs", "Reads"), times=c(length(tot.svs), length(tot.reads))))
  p <- ggplot(data=df, aes(x=Length, y=Count, color=Type)) + geom_point() + facet_wrap(~Type, scales="free_y") + theme_bw() + xlab("Amplicon Length")
  p
}
# Use on your sequence table
plotLengthDistro(seqtab.nochim)
plotLengthDistro(seqtab.nochim) + scale_y_log10()

image

I selected this range `seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 251:259]

However, I found a minor shift in taxa count after removing organelles when compared to the phyloseq object without excluding amplicons outside v4 length range. Is that okay? I mean there is no big difference but I just like to select sequences/amplicons of the expected v4 length. `

benjjneb commented 2 years ago

I like your plot!

However, I found a minor shift in taxa count after removing organelles when compared to the phyloseq object without excluding amplicons outside v4 length range.

I don't really know what that means, but it doesn't sound too concerning. Often amplicons out of the expected length range are coming from off-target amplifications of things like organelles.

Is that okay? I mean there is no big difference but I just like to select sequences/amplicons of the expected v4 length.

Yes, selecting to an expected length window is OK. No different than cutting a band in a gel.

emankhalaf commented 2 years ago

@benjjneb Much thanks!

emankhalaf commented 2 years ago

@benjjneb I have a question regarding the taxonomy assignment. I noticed that there is a difference in the taxonomy of the genus Leuconsotoc between the trained datasets although both are of the same version (v138). In another word, Leuconsotoc in (silva_nr99_v138.1_wSpecies_train_set.fa.gz) belongs to Lactobacillaceae, but in silva_nr_v138_train_set.fa.gz, it belongs to Leuconostocaceae This is confusing when comparing the microbiome taxonomy from MiSeq vs PacBio at different taxonomic levels! Just to confirm, and as recommended: For MiSeq , I am using: taxa <- assignTaxonomy(seqtab.nochim2, "/home/silva_nr_v138_train_set.fa.gz", multithread=TRUE) Then taxa <- addSpecies(taxa, "/home/silva_species_assignment_v138.fa.gz") and For PacBio tax <- assignTaxonomy(st.nochim, "/home/silva_nr99_v138.1_wSpecies_train_set.fa.gz", tryRC = TRUE, minBoot = 80, multithread=TRUE) Any recommendations here?

Thanks!

benjjneb commented 2 years ago

You have version 138 in one case, and 138.1 in the other case. You'll want to make sure you are using 138.1 constently. You can download the 138.1 versions here: https://zenodo.org/record/4587955#.YnAUypPMKrM

emankhalaf commented 2 years ago

@benjjneb Thank you very much!

emankhalaf commented 2 years ago

@benjjneb There are 2 available tutorials for DADA2; (1.16, and 1.8). The later (1.8) has a dereplication step:

derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names

This step is not included in 1.16 workflow. I ran both workflows and found the difference is minor. So, the difference in taxa count and read count when including the dereplication step is 11 taxa 22 reads. Anyway, I stick to the dereplication step, Any explanation here, please? Thanks!

benjjneb commented 2 years ago

Dereplication still happens, it just happens "on the fly" now, without the user having to run the dereplicate function themselves.

emankhalaf commented 2 years ago

@benjjneb This means both are okay (whether to use derep or not), correct?

benjjneb commented 2 years ago

Yes.

emankhalaf commented 2 years ago

Much thanks!

emankhalaf commented 2 years ago

@benjjneb

I am interested in using "Pooling for sample inference" method, but I am a little bit confused about the order of steps and I need to double-check whether what I am doing is correct or not. Please see below for MiSeq sequences:

errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)

The new step "Pooling for sample inference"

system.time(dd.poolF <- dada(derepFs, err=errF, pool=TRUE, DETECT_SINGLETONS = FALSE, multithread=TRUE))
system.time(dd.poolR <- dada(derepRs, err=errR, pool=TRUE, DETECT_SINGLETONS = FALSE, multithread=TRUE))
mergers <- mergePairs(dd.poolF, derepFs, dd.poolR, derepRs, verbose=TRUE)

seqtab <- makeSequenceTable(mergers)
dim(seqtab)

seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 251:259]

seqtab.nochim2 <- removeBimeraDenovo(seqtab2, method="consensus", multithread=TRUE, verbose=TRUE)
dim(seqtab.nochim2)

Then,

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dd.poolF, getN), sapply(dd.poolR, getN), sapply(mergers, getN), rowSums(seqtab.nochim2))

colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names2
head(track)

For updates, I used this code and I assigned the taxonomy using assignTaxonomy and addSpecies, and I got 684 taxa, whereas, I got 1520 taxa without pooling, any comment here??? Also, the difference in total read count between both methods is minor.

New updates, I re-ran the code but using singleton = TRUE, and I got 1169 taxa which is still less than the taxa count (1520) without pooling. Any recommendations here?

Also, I tried the pooling method with PacBio reads as follows: dd <- dada(drp, err=err, pool=TRUE, BAND_SIZE=32, OMEGA_A=1e-10, DETECT_SINGLETONS=FALSE, multithread=TRUE)

I noticed that it takes a longer time compared to the default method (more than 2 hours). Anyway, I got a higher taxa count (approx 200 more taxa) compared to the default workflow. However, with MiSeq reads, the taxa count was dramatically reduced by pooling, but the read counts did not change that much for both MiSeq and PacBio. Does this mean that we have more contaminants/chimeras from MiSeq compared to PacBio? For clarification, my samples represent different genotypes, so, what would be the best method to infer ASVs? Is independent sample processing (default workflow) more suitable for my samples?

To conclude, my samples were sequenced using MiSeq and PacBIo, and when using the default workflow, and not considering singletons (singletons = FALSE), MiSeq generated more than double the taxa count generated from PacBio. Whereas, when using the pooling method and still not considering singletons, MiSeq generated taxa counts less than PacBio. But still the read counts from MiSeq is far greater than PacBio's.

Your recommendations are much appreciated!

Thank you!

emankhalaf commented 2 years ago

Hi @benjjneb I checked the microbiome after pooling and I noticed that one of the dominant taxa (in both MiSeq and PacBio) dispappeared from MiSeq but still there from PacBio after pooling. I do not know why? Also, when using alpha diversity to compare both microbiomes, is it acceptable to use pooling with PacBio and independent sample processing with MiSeq? I mean should I use the same workflow/parameters to infer ASVs from both technologies to be able to compare between them?

benjjneb commented 2 years ago

I checked the microbiome after pooling and I noticed that one of the dominant taxa (in both MiSeq and PacBio) dispappeared from MiSeq but still there from PacBio after pooling. I do not know why?

I don't know. You could try to track the ASV associated with that taxon at different steps in the workflow in order to see what might have caused a change. Is is there before chimera reomval, for example?

Also, when using alpha diversity to compare both microbiomes, is it acceptable to use pooling with PacBio and independent sample processing with MiSeq? I mean should I use the same workflow/parameters to infer ASVs from both technologies to be able to compare between them?

You are already using different methodologies. Changing aspects of the bioinformatic processing could make those methodologies less comparable, but it could also make them more comparable if you are tailoring the bioinformatics to the specifics of each technology.

emankhalaf commented 2 years ago

Hi @benjjneb

I don't know. You could try to track the ASV associated with that taxon at different steps in the workflow in order to see what might have caused a change. Is is there before chimera reomval, for example?

Sorry, I did not explain this issue well. So, this taxon is dominant from PacBio (in terms of abundance and prevalence), but it is much less dominant from MiSeq when using the independent sample processing. When I used pooling, it disappeared from MiSeq but still dominant from PacBio. I searched this issue, and I found it is due to a primer bias to specific bacterial taxa including this taxon that could not be detected using 515F-806R primer set, but well-identified using 27F-1492R primers. As you recommended, I will check its presence before chimera removal and keep the forum updated.

You are already using different methodologies. Changing aspects of the bioinformatic processing could make those methodologies less comparable, but it could also make them more comparable if you are tailoring the bioinformatics to the specifics of each technology.

I am using alpha diversity metrics to compare these technologies, and I found what was significant in comparison (for example observed out) when using independent sample processing, is no longer significant after pooling since I have a dramatic reduction in taxa count from MISeq, whereas the count increased from PacBio (Kindly, please take a look on my posted question 6 days ago in the current issue, I copied part of it below)

I got a higher taxa count (approx 200 more taxa) compared to the default workflow. However, with MiSeq reads, the taxa count was dramatically reduced by pooling, but the read counts did not change that much for both MiSeq and PacBio. Does this mean that we have more contaminants/chimeras from MiSeq compared to PacBio? For clarification, my samples represent different genotypes, so, what would be the best method to infer ASVs? Is independent sample processing (default workflow) more suitable for my samples?

To conclude, my samples were sequenced using MiSeq and PacBIo, and when using the default workflow, and not considering singletons (singletons = FALSE), MiSeq generated more than double the taxa count generated from PacBio. Whereas, when using the pooling method and still not considering singletons, MiSeq generated taxa counts less than PacBio. But still the read counts from MiSeq is far greater than PacBio's.

In my project, I think pooling worked well with PacBio and helped to identify more rare taxa, however, it is not the best with MiSeq since it causes the disappearance of some important taxa.

For updates, I checked ASVs without removing chimeras, and I found the taxon. So, this means that the taxon was removed as a chimera. Also, the significant reduction in ASVs from MiSeq pipeline means those taxa were counted as chimeras (especially they are rare taxa except for the taxon that suffered from the primer bias) and removed when using "Pool = TRUE". Any comment here?

Thank you so much and sorry if I bothered you with my questions.