benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
462 stars 142 forks source link

Filter and trim error (dada2) #1890

Closed goodguynickpt closed 3 months ago

goodguynickpt commented 7 months ago

Hey, everyone. Hope everyone is doing ok. I am a masters student and I am writing my master thesis on tick microbiome. While I am collecting ticks and doing some lab/field work, I am also using data sets from a PhD student to analyze a few things to add some extra content to my thesis.

Anyways, I've worked with R in the past, for around 6 months but I am a total noob nowadays.

I am basically self-taught so I struggle with a myriad of simple things. I usually manage to solve most of them after a few hours but my R script is now giving me an error message that got me stumped.

I am trying to apply the code from this tutorial (https://benjjneb.github.io/dada2/tutorial.html) to my dataset and get a few graphs out of it.

I am currently trying to run this code:

Step 0: Install Rtools (If you haven't already)

Download from https://cran.rstudio.com/bin/windows/Rtools/ and install

Step 1: Install the 'dada2' Package

install.packages("dada2") library(dada2)

Step 2: File Paths (Adjusted according to your locations)

File Paths

pathF <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data" pathR <- pathF # Same as pathF since they are in the same directory filtpath <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data/filtered"

Step 3: Load Sample File Paths

fastqFs <- list.files(pathF, pattern="fastq.gz", full.names = TRUE) fastqRs <- list.files(pathR, pattern="fastq.gz", full.names = TRUE)

Step 4: Inspect Read Quality

plotQualityProfile(fastqFs[1:2]) plotQualityProfile(fastqRs[1:2])

File Paths

pathF <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data" pathR <- pathF # Same as pathF since they are in the same directory filtpath <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data/filtered"

Adjusted filterAndTrim command

outF <- file.path(filtpath, "filtered_f.fastq.gz") outR <- file.path(filtpath, "filtered_r.fastq.gz") out <- filterAndTrim(fwd = fastqFs[1], filt = outF, rev = fastqRs[1], multithread = TRUE)

Step 6: Error Learning

errF <- learnErrors(out[[1]], multithread = TRUE) errR <- learnErrors(out[[2]], multithread = TRUE)

Step 7: Sample Inference

dadaFs <- dada(out[[1]], err = errF, multithread = TRUE) dadaRs <- dada(out[[2]], err = errR, multithread = TRUE)

Step 8: Merging

mergers <- mergePairs(dadaFs, filtpath, dadaRs, filtpath)

Step 9: Construct Sequence Table

seqtab <- makeSequenceTable(mergers)

Step 10: Remove Chimeras

seqtab.nochim <- removeBimeraDenovo(seqtab)

Unfortunately, I always get an error in the filter and trim command. Console output:

Step 0: Install Rtools (If you haven't already)

Download from https://cran.rstudio.com/bin/windows/Rtools/ and install

Step 1: Install the 'dada2' Package

install.packages("dada2") WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/ Warning in install.packages : package ‘dada2’ is in use and will not be installed

library(dada2)

Step 2: File Paths (Adjusted according to your locations)

File Paths

pathF <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data" pathR <- pathF # Same as pathF since they are in the same directory filtpath <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data/filtered"

Step 3: Load Sample File Paths

fastqFs <- list.files(pathF, pattern="fastq.gz", full.names = TRUE) fastqRs <- list.files(pathR, pattern="fastq.gz", full.names = TRUE)

Step 4: Inspect Read Quality

plotQualityProfile(fastqFs[1:2]) Warning messages: 1: In .Internal(gc(verbose, reset, full)) : closing unused connection 4 (C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data/1_S1_R1_001.fastq.gz) 2: In .Internal(gc(verbose, reset, full)) : closing unused connection 3 (C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data/1_S1_R1_001.fastq.gz) 3: In serialize(data, node$con) : 'package:stats' may not be available when loading 4: In serialize(data, node$con) : 'package:stats' may not be available when loading plotQualityProfile(fastqRs[1:2]) Warning messages: 1: In serialize(data, node$con) : 'package:stats' may not be available when loading 2: In serialize(data, node$con) : 'package:stats' may not be available when loading

File Paths

pathF <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data" pathR <- pathF # Same as pathF since they are in the same directory filtpath <- "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data/filtered"

Adjusted filterAndTrim command

outF <- file.path(filtpath, "filtered_f.fastq.gz") outR <- file.path(filtpath, "filtered_r.fastq.gz") out <- filterAndTrim(fwd = fastqFs[1], filt = outF, rev = fastqRs[1], multithread = TRUE) Error in filterAndTrim(fwd = fastqFs[1], filt = outF, rev = fastqRs[1], : Output files for the reverse reads are required.

The filter and trim error changes randomly between the "output files for the reverse reads are required" above and

Proceed with filterAndTrim > out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,160), + maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, + compress=TRUE, multithread=TRUE) Error in filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen = c(240, 160), : Every input file must have a corresponding output file.

I've tried pretty much everything I can think of and I've been stuck on this for weeks.

I was trying to run some code that was basically a copy-paste version of the tutorial I pasted above but it was not running either.

I would truly appreciate some help! Thank you. Rplot bursa This is the only plot I can get from running the whole thing, before it comes crashing down.

benjjneb commented 7 months ago

It looks to me like your fastqFs and fastqRs are identical? What is the output of identical(fastqFs, fastqRs)? And what do the your filtered filenames look like? outF

goodguynickpt commented 7 months ago

Hi! Sorry for the late reply, I was doing some field work! Here is the output!

identical(fastqFs, fastqRs) [1] TRUE outF [1] "C:/Users/Lucas/OneDrive - Universidade de Lisboa/Desktop/BURSA_tissues_16S_microbioma_infravec/ZIP Rbursa INFRAVEC/220620-Infravec2-8115/1/Raw_Data/filtered/filtered_f.fastq.gz"

goodguynickpt commented 7 months ago

so, initially, I had these two folders - one with raw data and with FASTQC. image

these are in the Fastqc folder image

the raw folder used to have those zipped files and not the "filtered" folder you can see here (I created it in R trying to process the raw data) image

Inside the filtered folder, you can see that not all files are present - they should be in pairs, from 1 to 12. Initially, I only had 10, 11 and 12 and I managed, little by little, to get more filtered files. image

benjjneb commented 7 months ago

In you R code when you are defining the fastq files you want to filter and trim, and the filtered filenames you want to give them, you aren't disciminating between the forward and reverse files.

fastqFs <- list.files(pathF, pattern="fastq.gz", full.names = TRUE)
fastqRs <- list.files(pathR, pattern="fastq.gz", full.names = TRUE)

Those command yield identical outputs, because pathF and pathR are the same directory.

Then when you define the filtpath you seem to be just creating a single filename for everything (I'm not even sure how you created multiple output files). Finally, when you run filterAndTrim you are filtering forward and reverse files independently, when that isn't how it works -- you need to filter them together.

I'd recommend going back to the dada2 tutorial with a couple additional bits about filterAndTrim: fastqFs needs to be a vector of the forward (R1) filenames. fastqRs needs to be a vector of the reverse (R2) filenames (in matched order). filtFs and filtRs need to be unique vectors of filenames where the filtered forward/reverse files will be stored. And filterAndTrim should be run just once, on both forward and reverse files together, not separately on each.

goodguynickpt commented 6 months ago

Thank you! I will go back to the tutorial and try to fix it!