benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Fungal ITS pre-processing suggestion #327

Closed dleopold closed 6 years ago

dleopold commented 7 years ago

I would like to suggest an approach that I have found to dramatically improve the application of dada2 to fungal ITS amplicons. Some adapter removal programs use an alignment of the paired reads to trim the "read through" into the opposite direction adapter/primer. This approach allows trimming most shorter sequence reads to the actual length of the sequence of interest and generally removes low quality tails without needing a fixed length truncation or ending up with lots of length variation introduced by a sliding window quality trimmer. The program SeqPurge seems to work well for this, though there are others. I have also been following this trimming with a fixed-length crop of the forward and reverse reads (after looking to see where the quality drops off) but without throwing out shorter reads using the CROP function in trimmomatic (though you have to apply it to the fwd and rev reads separately if you want different crop lengths). A new fork of cutadapt called atropos looks like it will be able to do both of these trimming steps and can crop the fwd and rev reads to different lengths after adapter trimming, and generally seems like a promising tool for pre-processing paired-end ITS reads. However, I imagine cropping without throwing out shorter reads could also probably be implemented in the filterAndTrim function in dada2 fairly easily.

Before adding these two steps I was recovering much lower numbers of fungal ITS sequence variants than I expected with DADA2. Now the numbers are more in line with my expectation and are more similar to what I see with other denoising approaches (i.e., unoise3). So far I think this approach is working well for me, but I would be interested to hear what others think.

benjjneb commented 7 years ago

Thanks @dleopold this is very useful feedback. Currently we have avoided implementing any explicit primer-trimming or pre-merging algorithms, as we don't want to reinvent what others have done well elsewhere. However, we will be looking more closely at ITS-specific workflows, as that is outside our are of expertise and the variable-length functionality is pretty new in dada2. @nagasridhar will be helping with this.

We are very interested in useful ITS datasets to test on, if anyone has suggestions!

giriarteS commented 7 years ago

Hi @benjjneb, Below I describe my preprocessing steps for ITS (MiSeq 2x300):

  1. I use a cutadapt wrapper to trim primers (attached to the 5' end) and Nextera adapters (attached to the 3' end). The wrapper also removes sequences shorter than 50 bp after trimming.

#!/bin/bash

# Define variables READ1="CTGTCTCTTATACACATCTCCGAGCCCACGAGAC" #Nexterera adapter in FWD READ2="CTGTCTCTTATACACATCTGACGCTGCCGACGA" #Nexterera adapter in REV PRIMER_F="TCGATGAAGAACGCAGCG" #5.8SR FWR PRIMER PRIMER_R="TCCTCCGCTTATTGATATGC" #ITS4 REV PPRIMER MIN_LENGTH=50 # Discard trimmed reads that are shorter than MIN_LENGTH OVERLAP_MIN_LENGTH=10 MIN_QUALITY=10 CUTADAPT="$(which cutadapt) -q ${MIN_QUALITY} --minimum-length ${MIN_LENGTH} -O ${OVERLAP_MIN_LENGTH}" NAMES=(`gcat namelist`)

mkdir trimmed

for i in "${NAMES[@]}" do FORWARD_READ="${i}_R1.fastq" REVERSE_READ="${i}_R2.fastq" TRIMMED_R1="${i}_R1.fq" TRIMMED_R2="${i}_R2.fq"

${CUTADAPT} -a ${READ1} -A ${READ2} -g ${PRIMER_F} -G ${PRIMER_R} -o ../trimmed/${TRIMMED_R1} -p ../trimmed/${TRIMMED_R2} ${FORWARD_READ} ${REVERSE_READ} done

2) All the R1 and R2 reds are concatenated respectively to run usearch -fastq_eestats2. This will help to pick the maxEE.

3) Then I used filterAndTrim leaving truncLen out but I have to use a really high maxEE(around 20):

filterAndTrim(fnFs, filtFs, rev=fnRs, filt.rev=filtRs, minLen=50, truncQ=2, maxEE=20, rm.phix=TRUE, multithread=TRUE)

For common ITS amplicon strategies, it is undesirable to truncate reads to a fixed length due to the large amount of length variation at that locus. That is OK, just leave out truncLen. Make sure you removed the forward and reverse primers from both the forward and reverse reads though!

Should I trim the reads base on the eestats output?

eestats.xlsx

benjjneb commented 7 years ago

maxEE=20 is really high, reads with quality that bad are probably hurting more than they are helping to obtain an accurate picture of the community.

Can you get an acceptable number of reads through the filter at lower maxEE? For example, you would probably be better off losing half the reads in filtering if you can lower maxEE down to more like 2-4, but not 90+% of the reads.

dleopold commented 7 years ago

giriarteS - I think would avoid using the -q 10 parameter in cutadapt as preprocessing for dada. Quality trimming tails like this will result in lots of length variation in your reads, particularly in the reverse. As a result, the dereplication step will not be as effective. It is not yet clear to me how dada deals with length variation of this type (as opposed to true length variation among RSVs). Maybe Ben can fill us in.

giriarteS commented 7 years ago

I tried maxEE=c(3,6), with this parameters I was able to keep 30% of the reads per sample. Now I am merging my read with PEAR before running DADA2. I will post my results soon.

giriarteS commented 7 years ago

I found a workflow for preprocessing ITS here. My samples were sequenced at the University of Minnesota Genomic Center (UMGC), using their dual index protocol like the example in this workflow. I followed step 2 to 4 (2. trim adapters, 3. merge with PEAR, 4.Remove 5.8S head and LSU tail) and the quality filter step was done with dada2: filterAndTrim(fns, filts, minLen=50, maxEE=1, maxN=0, truncQ=2, rm.phix=TRUE,multithread=TRUE, verbose = TRUE)

After that ~90% of the reads/sample remain with ~400 bp average lenght.

werdnus commented 7 years ago

@giriarteS, that pretty much mirrors what I'm doing with ITS right now — pre-merging with PEAR doesn't seem to affect the sample inference, but allows for the retention of many more reads than merging after filtering, because the alignment "rescues" low quality reverse reads that would (in the DADA2 pipeline) trigger the high-quality forward weed to be discarded. I did need to add a second filtering step after merging to remove sequences with Ns, which DADA2 can't deal with.

Assigning taxonomy using the UNITE reference database seems to be robust to the inclusion of some ribosomal regions — I seem to get better inference and taxonomy without trimming them. Digging into the literature around ITSx, it seems like one of it's great strengths early on was that it efficiently separated real ITS variants from PhiX contamination, but DADA2 handles that very well. I've not seen any significant benefit to the trimming — just loss of information. I'm open to being convinced though! Anyone have stories about how ITSx or similar has saved them from some mis-interpretation?

giriarteS commented 7 years ago

I usually run ITSx on the final sequences (extracted from seqtab.nochim table) to add a column (origen) to my tax_table to prune my dataset later. It will we useful if assign taxonomy could output an alignment score to filter base on the percentage of alignment query length and bootstrap values.

benjjneb commented 7 years ago

Issue #322 is also relevant to this issue.

rachaelantwis commented 6 years ago

Thanks all for this - it's really helpful. giriarteS or werdnus - would you be willing to provide code for me to replicate with my data please? I can't quite figure out what you are assigning to "fns, filts" in the filter and trim step. werdnus - are you suggesting just run the dada2 pipeline without any pre-trimming or the reverse-sequenced primers? Many thanks.

giriarteS commented 6 years ago

This is what I did with my samples First trim adapters, then merge and remove 5.8S head and LSU tail from merged reads. Run dada2 with merged clean reads. Now reads are ready for dada2: Load path <- "path to merged clean reads" list.files(path) fns <- sort(list.files(path, pattern="assembled.fastq")) #this are your merged clean reads sample.names <- sapply(strsplit(fns, ""), [, 1) fns <- file.path(path, fns) Filter reads path filt_path <- file.path(path, "filtered") # Place filtered files in filtered/ subdirectory filts <- file.path(filt_path, paste0(sample.names, "_filt.fq.gz")) Filter out <- filterAndTrim(fns, filts, minLen=50, maxEE=1, maxN=0, truncQ=2, rm.phix=TRUE,multithread=TRUE, verbose = TRUE) head(out)

rachaelantwis commented 6 years ago

Thanks very much for this - I'll give it a go :)

rachaelantwis commented 6 years ago

Hi, thanks again for this. I'm getting the following error:

path <- "/Users/rachaelantwis/Desktop/fungi_test" list.files(path) [1] "Manc_ITS_merged_and_cut.fastq" fns <- sort(list.files(path, pattern="assembled.fastq")) sample.names <- sapply(strsplit(fns, ""), [, 1) Error: unexpected '[' in "sample.names <- sapply(strsplit(fns, ""), ["

Do you know what might be the problem? Many thanks

giriarteS commented 6 years ago

Use: fns <- sort(list.files(path, pattern="_merged_and_cut.fastq")) instead of fns <- sort(list.files(path, pattern="assembled.fastq"))

nagasridhar commented 6 years ago

Also, changing this line:

sample.names <- sapply(strsplit(fns, ""), [, 1)

to,

sample.names <- sapply(strsplit(fns, "_"), [, 1)

Shall capture your sample name as Manc

rachaelantwis commented 6 years ago

Thank you, but still the same problem:

path <- "/Users/rachaelantwis/Desktop/fungi_test" list.files(path) [1] "Manc_merged_and_cut.fastq" fns <- sort(list.files(path, pattern="_merged_andcut.fastq")) sample.names <- sapply(strsplit(fns, ""), [, 1) Error: unexpected '[' in "sample.names <- sapply(strsplit(fns, "_"), ["

rachaelantwis commented 6 years ago

(I used fns <- sort(list.files(path, pattern="_merged_and_cut.fastq")) but it's being changed to italics for some reason)

giriarteS commented 6 years ago

Try: sample.names <- sapply(strsplit(fns, "_"), [, 1)

rachaelantwis commented 6 years ago

Hmm, also doesn't work :/ could I send you the file perhaps please and see if you can figure it out? Many thanks for your help!

giriarteS commented 6 years ago

yes is not showing, you need a ` before and after the [

giriarteS commented 6 years ago

send me the file, do you want to shere a dropbox folder?

rachaelantwis commented 6 years ago

Ah thanks, that worked now! So how do the next steps worked if the reads are already merged?

I imagine something like this (but I'm guessing so please correct me if I'm wrong!):

#######LEARNING ERROR RATES

err <- learnErrors(filts, multithread=TRUE)

#Plot Error Rates
  plotErrors(err, nominalQ=TRUE)

#######DEREPLICATION / IDENTIFIYNG UNIQUE SEQUENCES

dereps <- derepFastq(filts, verbose=TRUE)

Name the derep-class objects by the sample names

names(dereps) <- sample.names

############INFERENCE OF SEQUENCE VARIANTS IN EACH SAMPLE

dadas <- dada(dereps, err=err, multithread=TRUE)

#insepct
  dadas[[1]]

Then this is where I get stuck: how do we get "mergers" to make the sequence variant table, remove chimeras and then track through the pipeline?

###########MERGE PAIRED READS

  mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
  # Inspect the merger data.frame from the first sample
  head(mergers[[1]])

########### MAKE SEQUENCE VARIANT TABLE

  seqtab <- makeSequenceTable(mergers)    

########### REMOVE CHIMERA

  seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
  dim(seqtab.nochim)

  #% Seqeunces Removed
  1-sum(seqtab.nochim)/sum(seqtab2)

########### TRACK READS THROUGH PIPELINE

  getN <- function(x) sum(getUniques(x))
  track <- cbind(out, sapply(dadaFs, getN), sapply(mergers, getN), rowSums(seqtab2), rowSums(seqtab.nochim))
  colnames(track) <- c("input", "filtered", "denoised", "merged", "tabled", "nonchim")
  rownames(track) <- sample.names
  head(track)

Thank you!

rachaelantwis commented 6 years ago

Sorry, that formatting is all over the shop

giriarteS commented 6 years ago

Use your dada output: seqtab <- makeSequenceTable(dadas) seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE)

rachaelantwis commented 6 years ago

PS @benjjneb benjjneb, I"d be happy to provide some ITS data for you to test this on?

nagasridhar commented 6 years ago

HI @rachaelantwis, I am a post-doc in Ben's lab and working on ITS analysis. It will be great if you could share the test data with us, will surely help me.

Thanks!

rachaelantwis commented 6 years ago

Could you send me an email to Rachael.antwis@gmail.com and we'll figure out the best way to share files? Thanks

gcuster1991 commented 6 years ago

@giriarteS Would you be able to provide some insight as to how you extract the sequences from your chimera free seq_table and extract the ITS region via ITSx? Thanks!

giriarteS commented 6 years ago

I use to assign taxonomy with dada2 using the UNITE database (minBoot=80) but later in the statistical analysis the majority of highly abundant unassigned sequences belonged to my host plant or nematodes. I still assign taxonomy with dada2 ((minBoot=80), but I use a specific ITS2 database (5.8S and LSU removed) and I just keep sequences that were identified as fungal ITS2 with the ITSx.

Extract sequences from chimera free SV table: uniquesToFasta(seqtab.nochim, '/path_to_fasta_file/uniques.fa', ids = paste0("SV", seq(length(getSequences(seqtab.nochim)))))

Run ITSx on them, I just select the fungal ITS (-t F) and my final ITS2 sequences have the 5.8S and LSU removed (--truncate T): ITSx -i uniques.fa -o output_directorty -p path_to_HMM-profiles --reset T -t F --allow_single_domain --allow_reorder T --cpu 8 --preserve T --save_regions 5.8S,ITS2,LSU --table T --detailed_results T --truncate T

Assign taxonomy and relabel sequence and tax table: seqtab_final <- seqtab.nochim colnames(seqtab_final) <- paste0("SV", 1:ncol(seqtab_final)) tax_final <- tax rownames(tax_final) <- paste0("SV", 1:nrow(tax_final))

Combine data into a phyloseq object and remove non fungal ITS from your phyloseq object.

gcuster1991 commented 6 years ago

@giriarteS Thank you for the quick reply. You refer to a specific ITS2 database. Is this readily available or did you custom format it?

giriarteS commented 6 years ago

I formatted the database. If you want a trained ITS1 or ITS2 database check UTAX

gcuster1991 commented 6 years ago

Hi @giriarteS I was able to run the first two blocks of code you provided. However, I'm unsure how i reimport my ITSx'ed data (ITSx_outs.ITS2.fasta). It appears the third block replaces the full sequence names with SV#. Am i missing something obvious? Thanks for you help!

gcuster1991 commented 6 years ago

As a follow up, I have managed to play with the sequence file and make a jenky workaround to get it imported and formatted correctly. However, now I am questioning the use of ITSx with the DADA2 pipeline.

After extracting the ITS2 region via ITSx some sequences are flagged as duplicates but were separated when including the 5.8 and LSU regions. If there are basepairs in the 5.8 or LSU region which differentiate the sequences from each other why would we want to remove those regions?

@nagasridhar any thoughts on this as you are working with DADA2 and ITS data?

giriarteS commented 6 years ago

I use the ITSx results as a guide to safely remove unidentified sequences. Just follow the normal pipeline in DADA2. Assign taxonomy using the UNITE database: tax <- assignTaxonomy(seqtab.nochim, ref_UNITE, minBoot = 80, tryRC = TRUE, multithread=TRUE, verbose = TRUE)

Add Ids to each sequence variant: seqtab_final <- seqtab.nochim colnames(seqtab_final) <- paste0("SV", 1:ncol(seqtab_final)) tax_final <- tax rownames(tax_final) <- paste0("SV", 1:nrow(tax_final))

Combine data into a phyloseq object: metadata_path <- "path_to_sample_data" samdf <- read.csv(metadata_path, header=TRUE) all(rownames(seqtab.nochim) %in% samdf$sampleid) rownames(samdf) <- samdf$sampleid

ps <- phyloseq(tax_table(tax_final), sample_data(samdf),otu_table(seqtab.nochim, taxa_are_rows = FALSE))

Check the ITSx output make a list of non fungal sequences and compare with the ones assigned just to kingdom or phylum. Then get your final list of sequences to remove: nonFungal = c("SV1", "SV10".........)

And filter the otu table: allTaxa = taxa_names(ps) goodTaxa <- allTaxa[!(allTaxa %in% nonFungal)] ps.1 = prune_taxa(goodTaxa, ps)

PrashINRA commented 6 years ago

Hii @benjjneb.. I have a good quality ITS dataset (4GB, which I analysed last year using DADA2 1.5), I am willing to share if you are still interested.

benjjneb commented 6 years ago

I think we are OK with data right now, but thanks for the offer @PrashINRA

chassenr commented 6 years ago

Hi, may I bother you with a not-fungal question about merging PE reads before running dada()? I am working with a V4V5 16S PE Illumina data set, which I analyzed with the default dada2 pipeline, and with a customized workflow consisting of quality trimming (trimmomatic), merging (PEAR), and only then running dada. I was very surprised that the number of sequence variants with the default pipeline was about 8 times higher than the number of sequence variants generated from the already merged data set. After chimera removal and excluding any singletons which may have been generated by the merging process in the default dada2 pipeline, the difference was still 4-fold. The rarefaction curves generated from both analysis outputs looked vastly different: the default output produces rarefaction curves which are very far from saturation, while merging before dada resulted in rarefaction curves that were already saturated. Comparing the 2 outputs based on inverse Simpson index (as an index that is less sensitive to rare types) also showed higher values for the default output - however, not for all samples. Is it possible that using an input consisting of variable sequence lengths can reduce the sensitivity of dada to detect sequence variants? Or is it rather that the merging in dada2 will inflate the number of sequence variants? Given the large proportion of chimeras detected in by the default pipeline, is it possible that the merging step will introduce chimeras?

Thanks a lot!

Cheers, Christiane

benjjneb commented 6 years ago

Hi @chassenr We do not recommend merging prior to the dada2 workflow. This is because the quality-score-aware error rates are learned by dada2 under the assumption that they will be the same across the entire read. This is not the case for pre-merged reads. This is because the overlapping part of the reads can have a different relationship between Q and error-rates than the non-overlapping parts.

chassenr commented 6 years ago

Thanks @benjjneb for the clarification. I am curious, though, because in one of your previous posts (#322) you mentioned that you may consider implementing a 'merge-first' workflow. Are you still pursuing this approach?

benjjneb commented 6 years ago

@chassenr We have looked into it, and continue to recommend against pre-merging. Particularly if using PEAR actually. The issue is that many of the read-merging programs do not report realistic quality scores for the merged region, which interferes w/ DADA2 as it uses those quality scores. If you must pre-merge, usearch's read-merging performed the best for this purpose in our tests.

We continue to have interest in developing this in the future, as it simplifies and speeds up the workflow a bit (especially for ITS). However, our development philosophy prioritizes accuracy above all else, and until we can show that a pre-merging method provides equivalent (or better) accuracy as post-merging, we will continue to recommend post-merging.

cjfields commented 6 years ago

I believe vsearch implements merging reads using PEAR's algorithm, so it's not quite a drop-in replacement for usearch in this respect.

benjjneb commented 6 years ago

We have now posted our "official" ITS workflow.

Not that it's the only right way to do things, but it's at least one right way.