benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

preparation steps to analyse ITS data #671

Closed magdawutkowska closed 5 years ago

magdawutkowska commented 5 years ago

I am trying dada2 for the first time and I am aiming to analyse ITS2 fungal sequences. I have demultiplexed my dataset first and followed initial steps of dada2 (https://benjjneb.github.io/dada2/ITS_workflow.html). In the end of the identify primers part of the pipeline in the table that gathers info about Forward Complement Reverse RevComp primers' sequences my table looks differently that one in the example. The forward primer and RevComp are both found in all 4 FWD.ForwardReads, FWD.ReverseReads, REV.ForwardReads and REV.ReverseReads. Thus I am not sure if I should reorient them just after demultiplexing, and then go for dada2 or can I reorient them in dada2? Thanks in advance for any ideas!

benjjneb commented 5 years ago

Is it possible your sequencing libraries are in both orientations (i.e. 5' -> 3' and 3' -> 5').

afmbafmb commented 5 years ago

Hello, sorry if I post my question in the wrong thread. I am very new to bioinformatics. Right now I'm trying to analyse my data using DADA2 ITS Pipeline. My only reason for using this instead of DADA2 Pipeline is so that I can use cutadapt in R (I'm not familiar with running cutadapt in python). I will continue with the usual DADA2 Pipeline after removing the primers.

I have run the pipeline up to primer removal step and got the output for my first sample as

Forward Complement Reverse RevComp FWD.ForwardReads 0 0 0 0 FWD.ReverseReads 202 0 0 0 REV.ForwardReads 0 0 0 0 REV.ReverseReads 0 0 0 1

My initial output with filtN sample before primer removal for the same sample was

Forward Complement Reverse RevComp FWD.ForwardReads 128137 0 0 0 FWD.ReverseReads 202 0 0 413
REV.ForwardReads 296 0 0 146 REV.ReverseReads 109811 0 0 1

This is different from the one shown in the ITS Pipeline workflow. Should I amend the code so that the Fwd. ReverseReads and Rev.ReverseReads be 0 as well? I am not sure if I am missing something here. Hope you can suggest the workaround.

Here are the details of my data: Illumina Miseq 2X300 paired end sequencing Sequence V6-V8 region Initial R1 and R2 fastq file with Illumina barcode and adapter removed primer set 926F (AAACTYAAAKGAATTGRCGG) 1392R (ACGGGCGGTGWGTRC)

Thanks for your time!

magdawutkowska commented 5 years ago

Is it possible your sequencing libraries are in both orientations (i.e. 5' -> 3' and 3' -> 5').

Thank you for your response! I think you are right. Is it possible to reorder the reads within dada2 pipeline or should I do that before after demultiplexing?

benjjneb commented 5 years ago

Is it possible to reorder the reads within dada2 pipeline or should I do that before after demultiplexing?

There isn't a full-fledged re-orientation solution in the dada2 package at this point. So if you can do that before (or after) demultiplexing, but before you start the dada2 pipeline, it would probably be best.

benjjneb commented 5 years ago

@afmbafmb Can you open a new issue (thread) on your problem?

magdawutkowska commented 5 years ago

Thank you @benjjneb!

magdawutkowska commented 5 years ago

Hi again! I am still a bit confused. The easiest way to redirect the PE reads for me is to merge them first (i.e. by using PEAR) and then reorient them using fgrep or similar tools. I know though that the core thing with dada2 is to process R1 and R2 reads separately, and then when they are 'resolved' merge them in one of the last steps.

This is an initial primer table from my subsampled data:

Forward Complement Reverse RevComp FWD.ForwardReads 113064 0 0 24300 FWD.ReverseReads 106609 0 0 15413 REV.ForwardReads 111728 0 0 24512 REV.ReverseReads 108476 0 0 14061

Whereas in your example:

Forward Complement Reverse RevComp FWD.ForwardReads 4214 0 0 0 FWD.ReverseReads 0 0 0 3743 REV.ForwardReads 0 0 0 3590 REV.ReverseReads 4200 0 0 0

And yes, my data consist of a mixture of amplicons starting with both forward and reverse primer, each ~50% in R1 and R2 files.

I found one solution, which is 'orient' function in usearch (https://www.drive5.com/usearch/manual/cmd_orient.html), which would align all the sequences in 5'-> 3' according to an attached database. Do you maybe have another, more straightforward suggestion of how to reorient them (I just want to make sure that I start dada2 with matching data)?

benjjneb commented 5 years ago

The easiest solution, and which does the least to break DADA2 assumptions, is to just use the data that is in the expected orientation.

nitschkematthew commented 5 years ago

I think I encountered a similar problem earlier in the ITS2 tutorial. I noticed that my R1 and R2 files each have forward and reverse reads in them at the cutadapt, demultiplexing stage. See description of the problem and workaround here <-

hlfreund commented 4 years ago

Hi Dr. Callahan! I have a quick, sort of stupid, question for you and I figured I would ask it here. I am running R1 ITS1 reads, and was wondering if I need to follow the DADA2 ITS Pipeline Workflow (1.8) first before following the normal DADA2 tutorial (in order to strip the primers).

I was looking through the other issues to try and find the answer to this question, but I couldn't seem to find anything with a clear answer.

Thank you for making such a fantastic pipeline and thank you for handling every issue directly!!

benjjneb commented 4 years ago

I am running R1 ITS1 reads, and was wondering if I need to follow the DADA2 ITS Pipeline Workflow (1.8) first before following the normal DADA2 tutorial (in order to strip the primers).

I would recommend following the ITS tutorial through to checking for the presence of the FWD and REV primers on your R1 reads.

If the REV primer is not on your reads, you can just go back to the simpler normal DADA2 tutorial. If the REV primer is on your reads, you will need to use the ITS/cutadapt tutorial that can remove primers at both the start and later in reads.

hlfreund commented 4 years ago

I am running R1 ITS1 reads, and was wondering if I need to follow the DADA2 ITS Pipeline Workflow (1.8) first before following the normal DADA2 tutorial (in order to strip the primers).

I would recommend following the ITS tutorial through to checking for the presence of the FWD and REV primers on your R1 reads.

If the REV primer is not on your reads, you can just go back to the simpler normal DADA2 tutorial. If the REV primer is on your reads, you will need to use the ITS/cutadapt tutorial that can remove primers at both the start and later in reads.

Hi Dr. Callahan, I started running my reads through the ITS tutorial as you suggested, and here is my output: image

Clearly the F primers have been removed, but I am seeing the reverse compliment of my REV primer. I am wondering if this is a fluke due to the nature and randomness of the reads, or if it's actually identifying the reversed REV primer in my reads. What are your thoughts? Thank you again for answering these questions so quickly.

benjjneb commented 4 years ago

Clearly the F primers have been removed, but I am seeing the reverse compliment of my REV primer. I am wondering if this is a fluke due to the nature and randomness of the reads, or if it's actually identifying the reversed REV primer in my reads. What are your thoughts?

My thoughts are that your amplicons are in the rev-comp orientation of what you expected. That is fine, but it is important to know that.

To fully confirm, can you open up one of your R1 fastq files, and see if the reverse complement of your REV primer is at the start of the reads? library(dada2); rc(REV) will give you the reverse-complement of the REV primer.

hlfreund commented 4 years ago

Clearly the F primers have been removed, but I am seeing the reverse compliment of my REV primer. I am wondering if this is a fluke due to the nature and randomness of the reads, or if it's actually identifying the reversed REV primer in my reads. What are your thoughts?

My thoughts are that your amplicons are in the rev-comp orientation of what you expected. That is fine, but it is important to know that.

To fully confirm, can you open up one of your R1 fastq files, and see if the reverse complement of your REV primer is at the start of the reads? library(dada2); rc(REV) will give you the reverse-complement of the REV primer.

Thank you for this. For whatever reason, I am finding the rcRev towards the end of some of my reads. This is an example from one of my sample fastq files (I just manually checked because I am continuing further with the ITS tutorial to see if I can remove them). What are your thoughts? Thank you again for your help with this. image

benjjneb commented 4 years ago

OK, so that is normal then. That means your reads were longer than the ITS1 region in some amplicons (ITS1 varies a lot in length) and read into the reverse primer.

That means you will need to use cutadapt to remove the rev-comp of the REV primer from those reads though. Unfortunately the dada2 package only has basic primer removal functionality, i.e. removes primers that appear at the start of the reads. But cutadapt will have you covered.

hlfreund commented 4 years ago

OK, so that is normal then. That means your reads were longer than the ITS1 region in some amplicons (ITS1 varies a lot in length) and read into the reverse primer.

That means you will need to use cutadapt to remove the rev-comp of the REV primer from those reads though. Unfortunately the dada2 package only has basic primer removal functionality, i.e. removes primers that appear at the start of the reads. But cutadapt will have you covered.

Thank you so much for your help with this. I am sorry to keep asking questions, but I am struggling with the cutadapt step.

When I get to the for loop for running cutadapt, I get an error stating: cutadapt: error: You did not provide any input file names. Please give me something to do!

I looked at my fastq files that fall within the "filtN" directory, and they no longer look like fastq files. They must have been corrupted somehow in the filtering step? Thank you again for all of your help on this.

benjjneb commented 4 years ago

I looked at my fastq files that fall within the "filtN" directory, and they no longer look like fastq files. They must have been corrupted somehow in the filtering step? Thank you again for all of your help on this.

By default, dada2 commands compress all files using gzip. cutadapt isn't too smart about detecting that, it just uses file extensions, and is probably confusing itself at that point.

You could either turn off compression with filterAndTrim(..., compress=FALSE), or change the names of your filtered fastq files to have a .gz at the end. If you aren't R comfortable, the first option is slightly easier.

hlfreund commented 4 years ago

I looked at my fastq files that fall within the "filtN" directory, and they no longer look like fastq files. They must have been corrupted somehow in the filtering step? Thank you again for all of your help on this.

By default, dada2 commands compress all files using gzip. cutadapt isn't too smart about detecting that, it just uses file extensions, and is probably confusing itself at that point.

You could either turn off compression with filterAndTrim(..., compress=FALSE), or change the names of your filtered fastq files to have a .gz at the end. If you aren't R comfortable, the first option is slightly easier.

I used the filterAndTrim command with compress=FALSE, but am still getting issues with cutadapt. I have a feeling it is my file names: we have samples from the same site at various depths, and here is an example file name: CTNA_30-40_S147_L001_R1_001.fastq

I think the "-" is causing problems, so I am going to rename the files and try again. Thank you so much for all of your help!!

hlfreund commented 4 years ago

So I have changed the file names, compress=FALSE...and I am still getting an error.

I have tried to change the loop to fit the fact that I am looking for the rc of the REV primer in only my R1 reads: for(i in seq_along(fnFs)) { system2(cutadapt, args = c("-g", FWD, "-b", REV.RC, "-o", fnFs.cut[i], # output files fnFs.filtN[i])) # input files }

I am following the structure based on cutadapt, but am getting an error for an invalid argument.

I am working with demulitplexed F reads of ITS1 sequences (Illumina MiSeq), and the primers used were the (forward) fITS7 and (reverse) ITS2 primers.

I am inserting my code below for your consideration: `setwd("/Volumes/HLF_SSD/Aronson_Lab_Data/XCZO_ITS1/xczo_fastq") getwd()

install.packages("devtools")

library("devtools")

devtools::install_github("benjjneb/dada2", ref="taxonomy-segfault-fix")

packageVersion("dada2") # Should be 1.15.1 library(dada2)

BiocManager::install("ShortRead")

library("ShortRead")

BiocManager::install("Biostrings")

library("Biostrings") install.packages("beepr") library("beepr")

path <- "/Volumes/HLF_SSD/Aronson_Lab_Data/XCZO_ITS1/xczo_fastq" # CHANGE ME to the directory containing the fastq files after unzipping.

path <- "/bigdata/aronsonlab/shared/XCZO/ITS1/sequences/unzipped_sequences"

list.files(path)

Now we read in the names of the fastq files, and perform some string manipulation to get matched lists of the forward and reverse fastq files.

Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq

fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))

fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))

fnFs FWD <- "CTTGGTCATTTAGAGGAAGTAA" # ITS1f REV <- "GCTGCGTTCTTCATCGATGC" # ITS2 (its1 reverse primer)

Check orientation of reads

allOrients <- function(primer) {

Create all orientations of the input sequence

require(Biostrings) dna <- DNAString(primer) # The Biostrings works w/ DNAString objects rather than character vectors orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna), RevComp = reverseComplement(dna)) return(sapply(orients, toString)) # Convert back to character vector } FWD.orients <- allOrients(FWD) REV.orients <- allOrients(REV) FWD.orients

fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) # Put N-filterd files in filtN/ subdirectory filterAndTrim(fnFs, fnFs.filtN, maxN = 0, compress=FALSE) # Pre-filter seqs to remove Ns fnFs.filtN <- fnFs.filtN[file.exists(fnFs.filtN)] # Drops files that failed filter entirely and weren't written

Count # of times that primers appear in reads

primerHits <- function(primer, fn) {

Counts number of reads in which the primer is found

nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE) return(sum(nhits > 0)) } rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]), REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.filtN[[1]]))

FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.filtN[[1]]),

 # , 
  #REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))

REMOVE PRIMERS

cutadapt <- "/Volumes/HLF_SSD/Aronson_Lab_Data/XCZO_ITS1/cutadapt" # CHANGE ME to the cutadapt path on your machine system2(cutadapt, args = "--version") # Run shell commands from R

path.cut <- file.path(path, "cutadapt") if(!dir.exists(path.cut)) dir.create(path.cut)

fnFs.cut <- file.path(path.cut, basename(fnFs.filtN))

fnRs.cut <- file.path(path.cut, basename(fnRs))

FWD.RC <- dada2:::rc(FWD) REV.RC <- dada2:::rc(REV)

Trim FWD and the reverse-complement of REV off of R1 (forward reads)

R1.flags <- paste("-g", FWD, "-b", REV.RC)

Trim REV and the reverse-complement of FWD off of R2 (reverse reads)

R2.flags <- paste("-G", REV, "-B", FWD.RC)

Run Cutadapt

for(i in seq_along(fnFs)) { system2(cutadapt, args = c("-g", FWD, "-b", REV.RC, "-o", fnFs.cut[i], # output files fnFs.filtN[i])) # input files }

Sanity Check

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]), FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]))`

benjjneb commented 4 years ago

system2(cutadapt, args = c("-g", FWD, "-b", REV.RC

I think the "-b" you put here should be a "-a"? And you can drop the "-g", FWD part since the FWD primers don't need to be removed.

hlfreund commented 4 years ago

On the cutadapt user guide (https://cutadapt.readthedocs.io/en/stable/guide.html), there is a little table that explains what arguments can be used depending on where your adapter/primer sequence is within the reads. I used the "-b" argument so that it could find the REV.RC anywhere in the sequence rather than strictly in one position over the other.

Screen Shot 2020-04-19 at 2 28 23 PM

I was actually able to run the cutadapt command via the terminal through all of my files, then improted them into R to follow the 1.12 tutorial - and I had success! I am still not sure why I couldn't get it to work in R, but I will try your suggestions (removing the -g, FWD) in the future - our lab is going to work together to follow your tutorial with some other reads we have. Thank you so much for your feedback!