Closed goodguynickpt closed 3 months ago
@goodguynickpt have you tried running these on your original raw data (prefiltered, pretrimmed)? Same for your reverse reads. The error profiles are missing all values below Q30 which is likely the issue.
The general consensus I believe is to perform standard primer sequence trimming, but to generally avoid quality trimming in favor of filtering based on expected errors.
hello @cjfields thank you for your reply! I was following the DADA2 tutorial I found online and so I assumed the error rate plots came right after I ran the filter and trim part.
I agree with you on the Q30 part because it doesn't make sense for me to trim something with a high QScore, only to check the expected errors of that very same filtered data. it should be the raw data.
I ran it again, using the raw data (pretty much going against the tutorial) and the result is almost the same (not sure I understand why).
@goodguynickpt sorry, you are correct, I was thinking of the read quality plots and not the error profiles; I needed more caffeine before I answered initilally.
But even so, it's still very strange; your read quality plots look like they have a full range of quality scores, but the error profiles in either case suggest few reads and a skewed quality score distribution (nothing below Q30). How are you running the initial trimming steps (code)?
So, initially, I was running this code (and it "worked" - but I used a QScore of 10 while trimming just so I could merge the reads so it was not viable):
packages <- c("ggplot2", "dada2")
installed_packages <- packages %in% rownames(installed.packages()) if (any(installed_packages == FALSE)) { install.packages(packages[!installed_packages]) }
invisible(lapply(packages, library, character.only = TRUE))
setwd("C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data")
path = "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data"
list.files(path)
fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE)) fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
sample.names <- sapply(strsplit(basename(fnFs), "_"), [
, 1); sample.names
plotQualityProfile(fnFs[1:12]) plotQualityProfile(fnRs[1:12])
filtFs <- file.path(path, "filtered", paste0(sample.names, "_R1_filt.fastq.gz")) filtRs <- file.path(path, "filtered", paste0(sample.names, "_R2_filt.fastq.gz")) names(filtFs) <- sample.names names(filtRs) <- sample.names
out_filtpath <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data/filtered" out <- sort(list.files(out_filtpath, pattern="_filt.fastq.gz", full.names = TRUE)) out
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
plotErrors(errF, nominalQ=TRUE) plotErrors(errR, nominalQ=TRUE)
dadaFs <- dada(filtFs, err=errF, multithread=TRUE) dadaRs <- dada(filtRs, err=errR, multithread=TRUE) dadaFs[[1:4]] dadaRs[[1:4]]
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
head(mergers[[1]])
seqtab <- makeSequenceTable(mergers)
dim(seqtab) # Check the number of samples and ASVs
table(nchar(getSequences(seqtab)))
target_length <- 465 # Can be adjusted to expected length seqtab2 <- seqtab[, nchar(colnames(seqtab)) %in% (target_length - 25):(target_length + 1)] # Adjust tolerance as needed
seqtab2.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
dim(seqtab2.nochim)
sum(seqtab2.nochim)/sum(seqtab)
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim") rownames(track) <- sample.names
head(track)
library(dada2)
taxonomy_path <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data" ref_file <- file.path(taxonomy_path, "silva_nr99_v138.1_train_set.fa.gz")
if (!file.exists(ref_file)) { stop("Reference file not found. Check the path and filename.") }
taxa <- assignTaxonomy(seqtab2.nochim, ref_file, multithread = TRUE)
taxa.print <- taxa # Removing sequence rownames for display only rownames(taxa.print) <- NULL head(taxa.print) taxa.print
if (!require(phyloseq)) { install.packages("physeq") } library(phyloseq)
library(ggplot2)
library(readr) OTU_table <- read.csv("OTU_table.csv") View(OTU_table)
samples.out <- rownames(seqtab2.nochim); samples.out
metadata <- read.csv("tick.csv", header=T) metadata
samples.out <- rownames(seqtab2.nochim); samples.out rownames(metadata) = samples.out; metadata
ps <- phyloseq(otu_table(seqtab2.nochim, taxa_are_rows=FALSE), sample_data(metadata), tax_table(taxa))
plot_richness(ps, x="Infection", measures=c("Shannon", "Simpson"), color="Tissue")
ps.prop <- transform_sample_counts(ps, function(otu) otu/sum(otu)) ord.nmds.bray <- ordinate(ps.prop, method="NMDS", distance="bray")
plot_ordination(ps.prop, ord.nmds.bray, color="Tissue", title="Bray NMDS")
plot_ordination(ps.prop, ord.nmds.bray, color = "Infection", title = "Bray NMDS") + geom_text(aes(label = samples.out), hjust = -1.5, vjust = 0.5, size = 3)
top <- names(sort(taxa_sums(ps), decreasing=TRUE))[1:600] ps.top <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU)) ps.top <- prune_taxa(top, ps.top) plot_bar(ps.top, x="Tissue", fill="Family") + geom_bar(stat="identity", width=0.8, color=NA) +
facet_wrap(~Infection, scales="free_x")
After deciding to just go with the forward reads, I am currently running this:
packages <- c("ggplot2", "dada2")
installed_packages <- packages %in% rownames(installed.packages()) if (any(installed_packages == FALSE)) { install.packages(packages[!installed_packages]) }
invisible(lapply(packages, library, character.only = TRUE))
setwd("C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/1/Raw_Data") # Updated path
setwd("C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/1/Raw_Data")
path = "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/1/Raw_Data"
list.files(path)
fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE)) fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
sample.names <- sapply(strsplit(basename(fnFs), "_"), [
, 1); sample.names
plotQualityProfile(fnFs[1:12]) plotQualityProfile(fnRs[1:12])
raw_forward_filenames <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))
errF <- learnErrors(raw_forward_filenames, multithread=TRUE)
plotErrors(errF, nominalQ=TRUE)
filtFs <- file.path(path, "filtered", paste0(sample.names, "_R1_filt.fastq.gz")) filtRs <- file.path(path, "filtered", paste0(sample.names, "_R2_filt.fastq.gz")) names(filtFs) <- sample.names names(filtRs) <- sample.names
out_filtpath <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/1/Raw_Data/filtered" out <- sort(list.files(out_filtpath, pattern="_R1_001.fastq.gz", full.names = TRUE)) out
forward_filenames <- sort(list.files(out_filtpath, pattern="_R1_001.fastq.gz", full.names = TRUE))
errF <- learnErrors(forward_filenames, multithread=TRUE)
plotErrors(errF, nominalQ=TRUE)
dadaFs <- dada(forward_filenames, err=errF, multithread=TRUE)
dadaFs[[1:4]]
seqtab <- makeSequenceTable(dadaFs)
dim(seqtab) # Check the number of samples and ASVs
table(nchar(getSequences(seqtab)))
target_length <- 250 # Can be adjusted to expected length seqtab2 <- seqtab[, nchar(colnames(seqtab)) %in% (target_length - 27):(target_length + 1)] # Adjust tolerance as needed
seqtab2.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
dim(seqtab2.nochim)
sum(seqtab2.nochim)/sum(seqtab)
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), rowSums(seqtab2.nochim))
colnames(track) <- c("input", "filtered", "dadaFs", "denoisedF", "nonchim") rownames(track) <- sample.names
head(track)
all_files <- sort(list.files("C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/1/Raw_Data/filtered", pattern="_R1_001.fastq.gz")) filenames <- gsub("\\.*$", "", all_files) # Extract sample names from filenames read_counts <- sapply(all_files, function(x) sum(grepl(pattern = "[ATCG]", x))) # Count reads based on sequence characters
out <- data.frame(filenames, read_counts)
track <- cbind(out, sapply(dadaFs, getN), rowSums(seqtab2.nochim)) colnames(track) <- c("SampleID", "Input", "Filtered", "Denoised") rownames(track) <- out$filenames # Use filenames as row names print(track)
library(dada2)
taxonomy_path <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/1/Raw_Data" ref_file <- file.path(taxonomy_path, "silva_nr99_v138.1_train_set.fa.gz")
if (!file.exists(ref_file)) { stop("Reference file not found. Check the path and filename.") }
taxa <- assignTaxonomy(seqtab2.nochim, ref_file, multithread = TRUE)
taxa.print <- taxa # Removing sequence rownames for display only rownames(taxa.print) <- NULL head(taxa.print) taxa.print
if (!require(phyloseq)) { install.packages("physeq") } library(phyloseq)
library(ggplot2)
library(readr) OTU_table <- read.csv("OTU_table.csv") View(OTU_table)
samples.out <- rownames(seqtab2.nochim); samples.out
metadata <- read.csv("tick.csv", header=T) metadata
samples.out <- rownames(seqtab2.nochim); samples.out rownames(metadata) = samples.out; metadata
ps <- phyloseq(otu_table(seqtab2.nochim, taxa_are_rows=FALSE), sample_data(metadata), tax_table(taxa))
plot_richness(ps, x="Infection", measures=c("Shannon", "Simpson"), color="Tissue")
ps.prop <- transform_sample_counts(ps, function(otu) otu/sum(otu)) ord.nmds.bray <- ordinate(ps.prop, method="NMDS", distance="bray")
plot_ordination(ps.prop, ord.nmds.bray, color="Tissue", title="Bray NMDS")
plot_ordination(ps.prop, ord.nmds.bray, color = "Infection", title = "Bray NMDS") + geom_text(aes(label = samples.out), hjust = -1.5, vjust = 0.5, size = 3)
top <- names(sort(taxa_sums(ps), decreasing=TRUE))[1:4470] ps.top <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU)) ps.top <- prune_taxa(top, ps.top) plot_bar(ps.top, x="Tissue", fill="Family") + geom_bar(stat="identity", width=0.8, color=NA) +
facet_wrap(~Infection, scales="free_x")
top <- names(sort(taxa_sums(ps), decreasing=TRUE))[1:4470] ps.top <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU)) ps.top <- prune_taxa(top, ps.top)
plot_bar(ps.top, x="Tissue", fill="Family") + geom_bar(stat="identity", width=0.8, color=NA) +
I apologize for the lack of organization on my script, I do want to tidy it up a bit but I am currently abroad and just trying to get it to work to compare it to my qiime2 analysis (my thesis is basically creating a generalized pipeline with dada2 for further projects).
I can also add that I tried trimming by length and by quality score, initially, but I could never merge properly. hence the decision to trim with a quality score of 30 and picking the forward reads, discarding the reverse reads.
I can't see anything obvious apart from that odd error profile with nothing below Q30. It might not be easy to diagnose w/o seeing example data. Do you see the same issue using 1-2 samples versus all of them? And is it expected that you have 500,000 reads in one of your samples?
Both of your error rate plots look fine, just not typical because you have previously truncated all reads at Q30. The second set of error plots is identical to the first.
We do not recommend using truncQ
at all. maxEE
is the recommended way to control for overall read quality. Previously you probably had issues with merging and chimera removal is because you aren't removing your primers (which usually occur at the start of the reads with V3V4 sequencing on Illumina) and you may also have been truncating your reads too short to merge: The V3V4 amplicon with primers sequences is 440-462nts long, so the two truncation lengths must add up to at least 462 + 12 nts to get proper merging.
So, I think your problems are likely to be resolved just by fixing the filterAndTrim
step by adding trimLeft=c(17,21)
assuming you are using the "Illumina" V3V4 sequencing protocol, removing truncQ
, including maxEE
and picking truncLen=c(TRUNC_F, TRUNC_R)
so that TRUNC_F + TRUNC_R >= 474.
The dada2 tutorial covers much of this.
Ah yes, I missed that the primers weren't removed in filterAndTrim
!
I'd also suggest to do a clean run from scratch (remove data from any prior runs). The input data is 2x250 based on the plotQualityProfile
output (which don't look pre-filtered/trimmed), which is read from the original data path. But maybe this step is picking up the original Q30-filtered data from the prior run, so getting the same plots twice?
filtFs <- file.path(path, "filtered", paste0(sample.names, "_R1_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R2_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names
With MiSeq V3-V4, in a number of cases we've had to resort to using only R1 due to issues with poor R2 quality and/or poor merging results, especially with 2x250 reads.
I apoligize, gentlemen - I got this dataset with not much context and I am learning as I go, which often leads to mistakes from sheer lack of knowledge. I am often required to make decisions and I am never sure whether or not they are viable. I will attempt your recommendations now and will update you as soon as it is done.
I can also add that I was instructed to follow this tutorial (at least until I got all of it to work) but I will now try to integrate everything from this one as well.
Thank you so much for your input and help, it is invaluable.
Hello, again, gentlemen. After applying the changes you recommended, I believe the merging rate is too low, which leads me the conclusion that it is preferable to only use my forward reads, as @cjfields mentioned.
Here is the console output:
Merging Forward and Reverse Reads
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE) 1219 paired-reads (in 14 unique pairings) successfully merged out of 606555 (in 984 pairings) input. 522 paired-reads (in 10 unique pairings) successfully merged out of 672822 (in 3882 pairings) input. 2989 paired-reads (in 20 unique pairings) successfully merged out of 517173 (in 2946 pairings) input. 530 paired-reads (in 13 unique pairings) successfully merged out of 700359 (in 3234 pairings) input. 17 paired-reads (in 2 unique pairings) successfully merged out of 662699 (in 3300 pairings) input. 12181 paired-reads (in 47 unique pairings) successfully merged out of 618513 (in 3381 pairings) input. 732 paired-reads (in 13 unique pairings) successfully merged out of 684288 (in 3316 pairings) input. 18 paired-reads (in 2 unique pairings) successfully merged out of 626960 (in 3273 pairings) input. 9477 paired-reads (in 44 unique pairings) successfully merged out of 560239 (in 1471 pairings) input. 9528 paired-reads (in 39 unique pairings) successfully merged out of 711136 (in 1451 pairings) input. 5491 paired-reads (in 16 unique pairings) successfully merged out of 599226 (in 1019 pairings) input. 1790 paired-reads (in 7 unique pairings) successfully merged out of 710098 (in 3356 pairings) input.
What exactly is the filterAndTrim
command you are using? The very low merging rates suggest that the truncation are too short.
I can also add that I was instructed to follow this tutorial (at least until I got all of it to work) but I will now try to integrate everything from this one as well.
I'll note that the verion 1.8 tutorial you linked is deprecated, the current version is here: https://benjjneb.github.io/dada2/tutorial.html
This is the command I ran this morning - attempting to use both the forward and reverse reads:
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,234), maxN=0, maxEE=c(2,2), trimLeft=c(17,21), rm.phix=TRUE, compress=TRUE, multithread=FALSE) # On Windows set multithread=FALSE
I was unsure about the truncLen numbers but I decided to not increase the initial value for the forward reads of 240.
Thank you for the link, I have used this one as well. I believe the one I followed until the assign taxonomy part was 1.16.
Huh, I don't understand why your merging would be so low then, as 474 total nts should be enough to merge. Maybe try increasing both truncation lenths by 5? But if that doesn't work, I concur with @cjfields about considering using just R1.
I am currently attempting to use the forward reads only but I will attempt increasing both truncation lengths by 5 once it is done! I will keep you posted.
The company report did not mention the low quality of the reverse reads, but using an automatized microbiome script (which used both dada2 and qiime2) on my trip to France, I was told that the reverse reads seemed unusable.
Again, I will give it one last try tonight and I will let you know!
Thank you so much!
Hello again, gentlemen. I have managed to get my code to work - thank you so much for your input! I have used the forward reads and the results are rather promising.
Glad to hear that it worked out @goodguynickpt !
Hey, everyone.
I am using DADA2 to analyse tick microbiome and I've come across a few roadblocks.
After trying everything to use both forward and reverse reads, trying to merge them and being unsuccessful (unless I trimmed with a quality score of 10 - which is less than ideal), I came to the conclusion that the reverse reads are just unusable and I've decided to proceed with the forward reads only.
Anyways, everything seemed fine until I ran the code to get the error rate plots - they look pretty awful. Here they are:
I have gone for a quality score of 30 when it comes to filtering and trimming, which seemed ideal.
Is there something I can/should do to fix this? I guess I could up the quality score threshold but literature says 30 is the standard so that is what I went for. My other question is why would my error rates would be so all over the place if I trimmed for a Q30 score - shouldn't the read quality be, by design, high and, therefore, have lower error rates? Or am I missing something?
I can also add that the V3 and V4 regions of the 16S rRNA gene of DNA samples were sequenced with high throughput NGS technology using the MiSeq platform (2x250 Paired Ends). And here are the quality score plots:
Forward reads
Reverse reads
Thank you for your help. João Lucas