Missing reads when transitioning from Dada2 to phyloseq

GingerMATE commented 2 months ago

Hello,

im encountering a similar issue to related to my sample reads. After completing the Dada2 pipeline according to the tutorial, i created a phyloseq object, but in doing so i loose reads in almost all my samples. Some of my samples loose over 50%.

Here is an excerpt from my track_reads file with added entry for when i create my phyloseq object:

sample | input | filtered | dadaF | dadaR | merged | nonchim | finalPercReadsKept % | reads physeq obj ArcB1 | 82650 | 47038 | 45576 | 46265 | 43855 | 42770 | 51.7 | 42770 AuB1 | 90559 | 56066 | 54249 | 55233 | 52257 | 50278 | 55.5 | 32479 AuB2 | 100718 | 62615 | 59976 | 60933 | 57438 | 55009 | 54.6 | 53549 AuB3 | 137090 | 81540 | 80356 | 80983 | 79213 | 76150 | 55.5 | 27221

There is a large iscrepancy between the seqtab. nochim reads and the phyloseq object.

This is the code i use to create my seqtab.nochim file in the pipeline:

seqtab <- makeSequenceTable(mergers)

output_directory="/Analysis/Pipeline"

saveRDS(mergers, "/Analysis/RData/mergers.rds")

save.image(file.path(output_directory, "mergers.RData"))

#Inspect the merger data.frame from the first sample

head(mergers)

class(mergers)

dim(seqtab)

#Inspect distribution of sequence lengths

table(nchar(getSequences(seqtab)))

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE) # minFoldParentOverAbundance = 8, multithread= parallel::detectCores(),

table(nchar(getSequences(seqtab.nochim)))

dim(seqtab.nochim)

sum(seqtab.nochim)/sum(seqtab)

saveRDS(seqtab, "/Analysis/RData/seqtab.rds")

saveRDS(seqtab.nochim, "/Analysis/RData/seqtab_nochim.rds")

And here is the code i use to create my phyloseq object:

Read sample Metadata and sample names

sample_metadata <- read.table("dna_metadata.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
rownames(sample_metadata) <- sample_metadata$sample_id
head(sample_metadata)
Read asv and taxa files

asv_counts_table <- readRDS("seqtab_nochim.rds")
asv_seqs <- colnames(asv_counts_table)
asv_headers <- vector(dim(asv_counts_table)[2], mode="character")
for (i in 1:dim(asv_counts_table)[2]) {
asv_headers[i] <- paste("ASV", i, sep="_")
}
colnames(asv_counts_table) <- asv_headers
head(asv_counts_table)
class(asv_counts_table)
sample_id <- sample_metadata$sample_id
rownames(asv_counts_table) <- sample_id
load taxa file

taxa_with_species <- read.table("taxa_.csv", header=TRUE, row.names=1, sep=",", stringsAsFactors=FALSE)
head(taxa_with_species)
class(taxa_with_species)
taxa_with_species <- as.matrix(taxa_with_species)
Create phyloseq object

physeq_ps137 <- phyloseq(otu_table(asv_counts_table, taxa_are_rows=FALSE),
tax_table(taxa_with_species), sample_data(sample_metadata))

physeq_ps137
readcount(physeq_ps137)

Any idea what could cause the reads to disappear? Or am i using the readcount function wrong? Thanks in advance!

benjjneb commented 2 months ago

I'm not sure what is defining the readcount function, but it is not a phyloseq function.

If you are trying to get the total abundances (numbers of reads) in each sample from a phyloseq object, the appropriate function to use is sample_sums.

GingerMATE commented 2 months ago

I tried sample_sums to check the reads, but it gave me the same result.

Create phyloseq object

physeq_ps137 <- phyloseq(otu_table(asv_counts_table, taxa_are_rows=TRUE), 
                         tax_table(taxa_with_species), sample_data(sample_metadata)) 

physeq_ps137
sample_sums(physeq_ps137)
readcount(physeq_ps137)

ArcB1 AuP1 AuB1 AuB2 AuB3
42770 45601 32479 53549 27221

The problematic step is between the seqtab.nochim and the physeq object. I assigend my taxa with the tryRC=TRUE argument in the assignTaxonomy function. Do you think that this could have an affect on the number of reads?

Here is the code i used to assign the taxonomy:

Assign taxonomy

unite.ref <- "DNA_results/Dada2/silva_nr99_v138.1_train_set.fa.gz"  # CHANGE ME to location on your machine

seqtab.nochim <- readRDS("DNA_results/seqtab_nochim.rds")

taxa <- assignTaxonomy(seqtab.nochim, unite.ref, multithread = parallel::detectCores(), tryRC = TRUE)

taxa <- addSpecies(taxa, "DNA_results/Dada2/silva_species_assignment_v138.1.fa.gz")

taxa.print <- taxa  # Removing sequence rownames for display only

rownames(taxa.print) <- NULL

head(taxa.print)

write.table (taxa, file ="DNA_results/Dada2/taxa_PS137_general_def_tryRCTRUE.csv", row.names = T, sep = "\t")

asv_seqs <- rownames(taxa_table)

asv_names <- character(nrow(taxa_table))

for (i in 1:nrow(taxa_table)) {
  asv_names[i] <- paste("ASV", i, sep="_")
}

rownames(taxa_table) <- asv_names

write.table(taxa_table, "DNA_results/Dada2/taxa_ps137_tryRCTRUE", sep="\t", quote=F, col.names=T)

benjjneb commented 2 months ago

I would suggest trying to construct the phyloseq object directly from the original sequence table and tax table from DADA2 (i.e. that contain the full DNA sequences as IDs), and checking if that solves the change in read counts issue.

It is simpler and less error prone to then switch to "ASV1" style IDs in the phyloseq object. Code for this is included in the DADA2 tutorial "Handoff to phyloseq" section:

dna <- Biostrings::DNAStringSet(taxa_names(ps))
names(dna) <- taxa_names(ps)
ps <- merge_phyloseq(ps, dna)
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
ps

benjjneb / dada2

Missing reads when transitioning from Dada2 to phyloseq #1933

Create phyloseq object

Assign taxonomy