benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
466 stars 142 forks source link

DADA2 Error Rate Plots #1963

Closed goodguynickpt closed 3 months ago

goodguynickpt commented 4 months ago

Hey, everyone.

I am using DADA2 to analyse tick microbiome and I've come across a few roadblocks.

After trying everything to use both forward and reverse reads, trying to merge them and being unsuccessful (unless I trimmed with a quality score of 10 - which is less than ideal), I came to the conclusion that the reverse reads are just unusable and I've decided to proceed with the forward reads only.

Anyways, everything seemed fine until I ran the code to get the error rate plots - they look pretty awful. Here they are: Rplot - Error Rates Plot Q30

I have gone for a quality score of 30 when it comes to filtering and trimming, which seemed ideal.

Is there something I can/should do to fix this? I guess I could up the quality score threshold but literature says 30 is the standard so that is what I went for. My other question is why would my error rates would be so all over the place if I trimmed for a Q30 score - shouldn't the read quality be, by design, high and, therefore, have lower error rates? Or am I missing something?

I can also add that the V3 and V4 regions of the 16S rRNA gene of DNA samples were sequenced with high throughput NGS technology using the MiSeq platform (2x250 Paired Ends). And here are the quality score plots:

Forward reads Rplot - FQualityScores

Reverse reads Rplot - RQualityScores

Thank you for your help. João Lucas

cjfields commented 4 months ago

@goodguynickpt have you tried running these on your original raw data (prefiltered, pretrimmed)? Same for your reverse reads. The error profiles are missing all values below Q30 which is likely the issue.

The general consensus I believe is to perform standard primer sequence trimming, but to generally avoid quality trimming in favor of filtering based on expected errors.

goodguynickpt commented 4 months ago

hello @cjfields thank you for your reply! I was following the DADA2 tutorial I found online and so I assumed the error rate plots came right after I ran the filter and trim part.

I agree with you on the Q30 part because it doesn't make sense for me to trim something with a high QScore, only to check the expected errors of that very same filtered data. it should be the raw data.

I ran it again, using the raw data (pretty much going against the tutorial) and the result is almost the same (not sure I understand why). Rplot - raw date error rates

cjfields commented 4 months ago

@goodguynickpt sorry, you are correct, I was thinking of the read quality plots and not the error profiles; I needed more caffeine before I answered initilally.

But even so, it's still very strange; your read quality plots look like they have a full range of quality scores, but the error profiles in either case suggest few reads and a skewed quality score distribution (nothing below Q30). How are you running the initial trimming steps (code)?

goodguynickpt commented 4 months ago

So, initially, I was running this code (and it "worked" - but I used a QScore of 10 while trimming just so I could merge the reads so it was not viable):

Package names

packages <- c("ggplot2", "dada2")

Install packages not yet installed

installed_packages <- packages %in% rownames(installed.packages()) if (any(installed_packages == FALSE)) { install.packages(packages[!installed_packages]) }

Packages loading

invisible(lapply(packages, library, character.only = TRUE))

-------------------------------

3. Filter and Trim

-------------------------------

Define working directory

setwd("C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data")

CHANGE ME to the directory containing the fastq files after unzipping.

path = "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data"

list.files(path)

Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq

fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE)) fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))

Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq

sample.names <- sapply(strsplit(basename(fnFs), "_"), [, 1); sample.names

plotQualityProfile(fnFs[1:12]) plotQualityProfile(fnRs[1:12])

Place filtered files in filtered/ subdirectory

filtFs <- file.path(path, "filtered", paste0(sample.names, "_R1_filt.fastq.gz")) filtRs <- file.path(path, "filtered", paste0(sample.names, "_R2_filt.fastq.gz")) names(filtFs) <- sample.names names(filtRs) <- sample.names

comment after doing it once

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,

maxN=0, maxEE=c(2,5), truncQ=10, rm.phix=TRUE,

compress=TRUE, multithread=FALSE) # On Windows set multithread=FALSE

this should be run after filter and trim is commented

out_filtpath <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data/filtered" out <- sort(list.files(out_filtpath, pattern="_filt.fastq.gz", full.names = TRUE)) out

Learn error rates

errF <- learnErrors(filtFs, multithread=TRUE)

139089237 total bases in 577362 reads from 1 samples will be used for learning the error rates.

errR <- learnErrors(filtRs, multithread=TRUE)

144720498 total bases in 577362 reads from 1 samples will be used for learning the error rates.

Sanity check: Visualize error rates

plotErrors(errF, nominalQ=TRUE) plotErrors(errR, nominalQ=TRUE)

dadaFs <- dada(filtFs, err=errF, multithread=TRUE) dadaRs <- dada(filtRs, err=errR, multithread=TRUE) dadaFs[[1:4]] dadaRs[[1:4]]

Merging Forward and Reverse Reads

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)

Inspecting Merged Sequences

head(mergers[[1]])

Construct Sequence Table (ASV Table)

seqtab <- makeSequenceTable(mergers)

Inspect Dimensions

dim(seqtab) # Check the number of samples and ASVs

Inspect Distribution of Sequence Lengths

table(nchar(getSequences(seqtab)))

Optional Filtering Based on Length

target_length <- 465 # Can be adjusted to expected length seqtab2 <- seqtab[, nchar(colnames(seqtab)) %in% (target_length - 25):(target_length + 1)] # Adjust tolerance as needed

Remove Chimeras

seqtab2.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)

Inspect Dimensions of Filtered Table

dim(seqtab2.nochim)

Calculate Proportion of Sequences Retained

sum(seqtab2.nochim)/sum(seqtab)

getN <- function(x) sum(getUniques(x))

track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))

colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim") rownames(track) <- sample.names

head(track)

Load necessary library for assigning taxonomy

library(dada2)

Define the path to the SILVA taxonomy file

taxonomy_path <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data" ref_file <- file.path(taxonomy_path, "silva_nr99_v138.1_train_set.fa.gz")

Check if the file exists to avoid errors during execution

if (!file.exists(ref_file)) { stop("Reference file not found. Check the path and filename.") }

Run assignTaxonomy

taxa <- assignTaxonomy(seqtab2.nochim, ref_file, multithread = TRUE)

taxa.print <- taxa # Removing sequence rownames for display only rownames(taxa.print) <- NULL head(taxa.print) taxa.print

IT WORKS UNTIL THIS POINT

Evaluate Accuracy

unqs.mock <- seqtab2.nochim["Mock",]

unqs.mock <- sort(unqs.mock[unqs.mock>0], decreasing=TRUE) # Drop ASVs absent in the Mock

cat("DADA2 inferred", length(unqs.mock), "sample sequences present in the Mock community.\n")

mock.ref <- getSequences(file.path(path, "HMP_MOCK.v35.fasta"))

match.ref <- sum(sapply(names(unqs.mock), function(x) any(grepl(x, mock.ref))))

cat("Of those,", sum(match.ref), "were exact matches to the expected reference sequences.\n")

Check if phyloseq is already installed

if (!require(phyloseq)) { install.packages("physeq") } library(phyloseq)

Optional: Include ggplot2 for future visualizations

library(ggplot2)

library(readr) OTU_table <- read.csv("OTU_table.csv") View(OTU_table)

samples.out <- rownames(seqtab2.nochim); samples.out

import metadata (sample.id in rows, make sure the order by sample.id is the same) ====

metadata <- read.csv("tick.csv", header=T) metadata

Why on earth do I have to do this step? italian hand

samples.out <- rownames(seqtab2.nochim); samples.out rownames(metadata) = samples.out; metadata

ps <- phyloseq(otu_table(seqtab2.nochim, taxa_are_rows=FALSE), sample_data(metadata), tax_table(taxa))

maybe one day :)

taxa <- addSpecies(taxa, "rdp_species_assignment_14.fa.gz")

PLOTS!

Alpha Diversity

plot_richness(ps, x="Infection", measures=c("Shannon", "Simpson"), color="Tissue")

Transform data to proportions as appropriate for Bray-Curtis distances

ps.prop <- transform_sample_counts(ps, function(otu) otu/sum(otu)) ord.nmds.bray <- ordinate(ps.prop, method="NMDS", distance="bray")

Bray-Curtis Tissue

plot_ordination(ps.prop, ord.nmds.bray, color="Tissue", title="Bray NMDS")

Bray-Curtis Infection

plot_ordination(ps.prop, ord.nmds.bray, color = "Infection", title = "Bray NMDS") + geom_text(aes(label = samples.out), hjust = -1.5, vjust = 0.5, size = 3)

BAR PLOT Tissue

top <- names(sort(taxa_sums(ps), decreasing=TRUE))[1:600] ps.top <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU)) ps.top <- prune_taxa(top, ps.top) plot_bar(ps.top, x="Tissue", fill="Family") + geom_bar(stat="identity", width=0.8, color=NA) +

geom_text(aes(label = Abundance), # Uncomment these 3 lines for abundance

position = position_stack(vjust = 0.5), # labels on plot

size=3) +

facet_wrap(~Infection, scales="free_x")

After deciding to just go with the forward reads, I am currently running this:

Package names

packages <- c("ggplot2", "dada2")

Install packages not yet installed

installed_packages <- packages %in% rownames(installed.packages()) if (any(installed_packages == FALSE)) { install.packages(packages[!installed_packages]) }

Packages loading

invisible(lapply(packages, library, character.only = TRUE))

-------------------------------

1. Define Working Directory

-------------------------------

Set working directory (replace with your actual directory)

setwd("C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/1/Raw_Data") # Updated path

-------------------------------

2. Specify File Paths (replace with your actual paths)

-------------------------------

Define working directory

setwd("C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/1/Raw_Data")

CHANGE ME to the directory containing the fastq files after unzipping.

path = "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/1/Raw_Data"

list.files(path)

Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq

fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE)) fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))

Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq

sample.names <- sapply(strsplit(basename(fnFs), "_"), [, 1); sample.names

plotQualityProfile(fnFs[1:12]) plotQualityProfile(fnRs[1:12])

Assuming filenames have format "SAMPLENAME_R1_001.fastq" for forward reads (unfiltered)

raw_forward_filenames <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))

Learn error rates from the raw forward reads

errF <- learnErrors(raw_forward_filenames, multithread=TRUE)

Error rates for the forward reads

plotErrors(errF, nominalQ=TRUE)

-------------------------------

3. Filter and Trim Forward Reads

-------------------------------

Place filtered files in filtered/ subdirectory

filtFs <- file.path(path, "filtered", paste0(sample.names, "_R1_filt.fastq.gz")) filtRs <- file.path(path, "filtered", paste0(sample.names, "_R2_filt.fastq.gz")) names(filtFs) <- sample.names names(filtRs) <- sample.names

Assuming filenames have format "SAMPLENAME_R1_001.fastq" for forward reads

raw_forward_filenames <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))

forward_filenames <- filterAndTrim(raw_forward_filenames, # Input raw fastq filenames

filtered_reads_path, # Output directory for filtered reads

truncQ=30, # Minimum quality score for bases to be kept (set to 30)

compress=TRUE, # Compress output files

multithread=FALSE) # Set to FALSE for Windows

If filter and trim is done, run this instead

out_filtpath <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/1/Raw_Data/filtered" out <- sort(list.files(out_filtpath, pattern="_R1_001.fastq.gz", full.names = TRUE)) out

Assuming filenames have format "SAMPLENAME_R1_filt.fastq.gz" in the filtered directory

forward_filenames <- sort(list.files(out_filtpath, pattern="_R1_001.fastq.gz", full.names = TRUE))

-------------------------------

4. Learn Error Rates for Forward Reads

-------------------------------

Learn error rates from the filtered forward reads

errF <- learnErrors(forward_filenames, multithread=TRUE)

149790575 total bases in 913111 reads from 2 samples will be used for learning the error rates.

Error rates for the forward reads

plotErrors(errF, nominalQ=TRUE)

-------------------------------

5. Denoising with dada2 for Forward Reads

-------------------------------

Denoise the forward reads using dada2 (takes a while)

dadaFs <- dada(forward_filenames, err=errF, multithread=TRUE)

Inspect the first few sequences (optional)

dadaFs[[1:4]]

Construct Sequence Table (ASV Table) for forward reads

seqtab <- makeSequenceTable(dadaFs)

Inspect Dimensions

dim(seqtab) # Check the number of samples and ASVs

Inspect Distribution of Sequence Lengths (optional)

table(nchar(getSequences(seqtab)))

Optional Filtering Based on Length (optional)

target_length <- 250 # Can be adjusted to expected length seqtab2 <- seqtab[, nchar(colnames(seqtab)) %in% (target_length - 27):(target_length + 1)] # Adjust tolerance as needed

Remove Chimeras

seqtab2.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)

Inspect Dimensions of Filtered Table

dim(seqtab2.nochim)

Calculate Proportion of Sequences Retained

sum(seqtab2.nochim)/sum(seqtab)

This whole thing is a mess X____X

getN <- function(x) sum(getUniques(x))

track <- cbind(out, sapply(dadaFs, getN), rowSums(seqtab2.nochim))

colnames(track) <- c("input", "filtered", "dadaFs", "denoisedF", "nonchim") rownames(track) <- sample.names

head(track)

Assuming filenames have format "SAMPLENAME_R1_001.fastq.gz" in the filtered directory

all_files <- sort(list.files("C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/1/Raw_Data/filtered", pattern="_R1_001.fastq.gz")) filenames <- gsub("\\.*$", "", all_files) # Extract sample names from filenames read_counts <- sapply(all_files, function(x) sum(grepl(pattern = "[ATCG]", x))) # Count reads based on sequence characters

Combine filenames and read counts

out <- data.frame(filenames, read_counts)

track <- cbind(out, sapply(dadaFs, getN), rowSums(seqtab2.nochim)) colnames(track) <- c("SampleID", "Input", "Filtered", "Denoised") rownames(track) <- out$filenames # Use filenames as row names print(track)

Load necessary library for assigning taxonomy

library(dada2)

Define the path to the SILVA taxonomy file

taxonomy_path <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/1/Raw_Data" ref_file <- file.path(taxonomy_path, "silva_nr99_v138.1_train_set.fa.gz")

Check if the file exists to avoid errors during execution

if (!file.exists(ref_file)) { stop("Reference file not found. Check the path and filename.") }

Run assignTaxonomy

taxa <- assignTaxonomy(seqtab2.nochim, ref_file, multithread = TRUE)

taxa.print <- taxa # Removing sequence rownames for display only rownames(taxa.print) <- NULL head(taxa.print) taxa.print

IT WORKS UNTIL THIS POINT

Evaluate Accuracy

unqs.mock <- seqtab2.nochim["Mock",]

unqs.mock <- sort(unqs.mock[unqs.mock>0], decreasing=TRUE) # Drop ASVs absent in the Mock

cat("DADA2 inferred", length(unqs.mock), "sample sequences present in the Mock community.\n")

mock.ref <- getSequences(file.path(path, "HMP_MOCK.v35.fasta"))

match.ref <- sum(sapply(names(unqs.mock), function(x) any(grepl(x, mock.ref))))

cat("Of those,", sum(match.ref), "were exact matches to the expected reference sequences.\n")

Check if phyloseq is already installed

if (!require(phyloseq)) { install.packages("physeq") } library(phyloseq)

Optional: Include ggplot2 for future visualizations

library(ggplot2)

library(readr) OTU_table <- read.csv("OTU_table.csv") View(OTU_table)

samples.out <- rownames(seqtab2.nochim); samples.out

import metadata (sample.id in rows, make sure the order by sample.id is the same) ====

metadata <- read.csv("tick.csv", header=T) metadata

I don't understand the need to do this again but here we are

samples.out <- rownames(seqtab2.nochim); samples.out rownames(metadata) = samples.out; metadata

ps <- phyloseq(otu_table(seqtab2.nochim, taxa_are_rows=FALSE), sample_data(metadata), tax_table(taxa))

maybe one day :) again

taxa <- addSpecies(taxa, "rdp_species_assignment_14.fa.gz")

PLOTS!

Alpha Diversity

plot_richness(ps, x="Infection", measures=c("Shannon", "Simpson"), color="Tissue")

Transform data to proportions as appropriate for Bray-Curtis distances

ps.prop <- transform_sample_counts(ps, function(otu) otu/sum(otu)) ord.nmds.bray <- ordinate(ps.prop, method="NMDS", distance="bray")

Bray-Curtis Tissue

plot_ordination(ps.prop, ord.nmds.bray, color="Tissue", title="Bray NMDS")

Bray-Curtis Infection

plot_ordination(ps.prop, ord.nmds.bray, color = "Infection", title = "Bray NMDS") + geom_text(aes(label = samples.out), hjust = -1.5, vjust = 0.5, size = 3)

BAR PLOT Tissue

top <- names(sort(taxa_sums(ps), decreasing=TRUE))[1:4470] ps.top <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU)) ps.top <- prune_taxa(top, ps.top) plot_bar(ps.top, x="Tissue", fill="Family") + geom_bar(stat="identity", width=0.8, color=NA) +

geom_text(aes(label = Abundance), # Uncomment these 3 lines for abundance

position = position_stack(vjust = 0.5), # labels on plot

size=3) +

facet_wrap(~Infection, scales="free_x")

top <- names(sort(taxa_sums(ps), decreasing=TRUE))[1:4470] ps.top <- transform_sample_counts(ps, function(OTU) OTU/sum(OTU)) ps.top <- prune_taxa(top, ps.top)

Generate 72 distinct colors from the "viridis" palette colors <- viridis(n = 72, option = "cividis") # Adjust option if desired plot_bar(ps.top, x = "Tissue", fill = "Family", color = colors) + geom_bar(stat = "identity", width = 0.8, color = NA) +

plot_bar(ps.top, x="Tissue", fill="Family") + geom_bar(stat="identity", width=0.8, color=NA) +

geom_text(aes(label = Abundance),  # Uncomment these 3 lines for abundance

I apologize for the lack of organization on my script, I do want to tidy it up a bit but I am currently abroad and just trying to get it to work to compare it to my qiime2 analysis (my thesis is basically creating a generalized pipeline with dada2 for further projects).

goodguynickpt commented 4 months ago

I can also add that I tried trimming by length and by quality score, initially, but I could never merge properly. hence the decision to trim with a quality score of 30 and picking the forward reads, discarding the reverse reads.

cjfields commented 4 months ago

I can't see anything obvious apart from that odd error profile with nothing below Q30. It might not be easy to diagnose w/o seeing example data. Do you see the same issue using 1-2 samples versus all of them? And is it expected that you have 500,000 reads in one of your samples?

benjjneb commented 4 months ago

Both of your error rate plots look fine, just not typical because you have previously truncated all reads at Q30. The second set of error plots is identical to the first.

We do not recommend using truncQ at all. maxEE is the recommended way to control for overall read quality. Previously you probably had issues with merging and chimera removal is because you aren't removing your primers (which usually occur at the start of the reads with V3V4 sequencing on Illumina) and you may also have been truncating your reads too short to merge: The V3V4 amplicon with primers sequences is 440-462nts long, so the two truncation lengths must add up to at least 462 + 12 nts to get proper merging.

So, I think your problems are likely to be resolved just by fixing the filterAndTrim step by adding trimLeft=c(17,21) assuming you are using the "Illumina" V3V4 sequencing protocol, removing truncQ, including maxEE and picking truncLen=c(TRUNC_F, TRUNC_R) so that TRUNC_F + TRUNC_R >= 474.

The dada2 tutorial covers much of this.

cjfields commented 4 months ago

Ah yes, I missed that the primers weren't removed in filterAndTrim!

I'd also suggest to do a clean run from scratch (remove data from any prior runs). The input data is 2x250 based on the plotQualityProfile output (which don't look pre-filtered/trimmed), which is read from the original data path. But maybe this step is picking up the original Q30-filtered data from the prior run, so getting the same plots twice?

filtFs <- file.path(path, "filtered", paste0(sample.names, "_R1_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R2_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names

With MiSeq V3-V4, in a number of cases we've had to resort to using only R1 due to issues with poor R2 quality and/or poor merging results, especially with 2x250 reads.

goodguynickpt commented 4 months ago

I apoligize, gentlemen - I got this dataset with not much context and I am learning as I go, which often leads to mistakes from sheer lack of knowledge. I am often required to make decisions and I am never sure whether or not they are viable. I will attempt your recommendations now and will update you as soon as it is done.

I can also add that I was instructed to follow this tutorial (at least until I got all of it to work) but I will now try to integrate everything from this one as well.

Thank you so much for your input and help, it is invaluable.

goodguynickpt commented 4 months ago

Hello, again, gentlemen. After applying the changes you recommended, I believe the merging rate is too low, which leads me the conclusion that it is preferable to only use my forward reads, as @cjfields mentioned.

Here is the console output:

Merging Forward and Reverse Reads

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE) 1219 paired-reads (in 14 unique pairings) successfully merged out of 606555 (in 984 pairings) input. 522 paired-reads (in 10 unique pairings) successfully merged out of 672822 (in 3882 pairings) input. 2989 paired-reads (in 20 unique pairings) successfully merged out of 517173 (in 2946 pairings) input. 530 paired-reads (in 13 unique pairings) successfully merged out of 700359 (in 3234 pairings) input. 17 paired-reads (in 2 unique pairings) successfully merged out of 662699 (in 3300 pairings) input. 12181 paired-reads (in 47 unique pairings) successfully merged out of 618513 (in 3381 pairings) input. 732 paired-reads (in 13 unique pairings) successfully merged out of 684288 (in 3316 pairings) input. 18 paired-reads (in 2 unique pairings) successfully merged out of 626960 (in 3273 pairings) input. 9477 paired-reads (in 44 unique pairings) successfully merged out of 560239 (in 1471 pairings) input. 9528 paired-reads (in 39 unique pairings) successfully merged out of 711136 (in 1451 pairings) input. 5491 paired-reads (in 16 unique pairings) successfully merged out of 599226 (in 1019 pairings) input. 1790 paired-reads (in 7 unique pairings) successfully merged out of 710098 (in 3356 pairings) input.

benjjneb commented 4 months ago

What exactly is the filterAndTrim command you are using? The very low merging rates suggest that the truncation are too short.

I can also add that I was instructed to follow this tutorial (at least until I got all of it to work) but I will now try to integrate everything from this one as well.

I'll note that the verion 1.8 tutorial you linked is deprecated, the current version is here: https://benjjneb.github.io/dada2/tutorial.html

https://github.com/benjjneb/dada2/issues/1960

goodguynickpt commented 4 months ago

This is the command I ran this morning - attempting to use both the forward and reverse reads:

Comment these lines in case there is no need to filter and trim again

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,234), maxN=0, maxEE=c(2,2), trimLeft=c(17,21), rm.phix=TRUE, compress=TRUE, multithread=FALSE) # On Windows set multithread=FALSE

I was unsure about the truncLen numbers but I decided to not increase the initial value for the forward reads of 240.

Thank you for the link, I have used this one as well. I believe the one I followed until the assign taxonomy part was 1.16.

benjjneb commented 4 months ago

Huh, I don't understand why your merging would be so low then, as 474 total nts should be enough to merge. Maybe try increasing both truncation lenths by 5? But if that doesn't work, I concur with @cjfields about considering using just R1.

goodguynickpt commented 4 months ago

I am currently attempting to use the forward reads only but I will attempt increasing both truncation lengths by 5 once it is done! I will keep you posted.

The company report did not mention the low quality of the reverse reads, but using an automatized microbiome script (which used both dada2 and qiime2) on my trip to France, I was told that the reverse reads seemed unusable.

Again, I will give it one last try tonight and I will let you know!

Thank you so much!

goodguynickpt commented 3 months ago

Hello again, gentlemen. I have managed to get my code to work - thank you so much for your input! I have used the forward reads and the results are rather promising.

cjfields commented 3 months ago

Glad to hear that it worked out @goodguynickpt !