Closed dleopold closed 6 years ago
Thanks @dleopold this is very useful feedback. Currently we have avoided implementing any explicit primer-trimming or pre-merging algorithms, as we don't want to reinvent what others have done well elsewhere. However, we will be looking more closely at ITS-specific workflows, as that is outside our are of expertise and the variable-length functionality is pretty new in dada2. @nagasridhar will be helping with this.
We are very interested in useful ITS datasets to test on, if anyone has suggestions!
Hi @benjjneb, Below I describe my preprocessing steps for ITS (MiSeq 2x300):
#!/bin/bash
# Define variables
READ1="CTGTCTCTTATACACATCTCCGAGCCCACGAGAC" #Nexterera adapter in FWD
READ2="CTGTCTCTTATACACATCTGACGCTGCCGACGA" #Nexterera adapter in REV
PRIMER_F="TCGATGAAGAACGCAGCG" #5.8SR FWR PRIMER
PRIMER_R="TCCTCCGCTTATTGATATGC" #ITS4 REV PPRIMER
MIN_LENGTH=50 # Discard trimmed reads that are shorter than MIN_LENGTH
OVERLAP_MIN_LENGTH=10
MIN_QUALITY=10
CUTADAPT="$(which cutadapt) -q ${MIN_QUALITY} --minimum-length ${MIN_LENGTH} -O ${OVERLAP_MIN_LENGTH}"
NAMES=(
`gcat namelist`)
mkdir trimmed
for i in "${NAMES[@]}"
do
FORWARD_READ="${i}_R1.fastq"
REVERSE_READ="${i}_R2.fastq"
TRIMMED_R1="${i}_R1.fq"
TRIMMED_R2="${i}_R2.fq"
${CUTADAPT} -a ${READ1} -A ${READ2} -g ${PRIMER_F} -G ${PRIMER_R} -o ../trimmed/${TRIMMED_R1} -p ../trimmed/${TRIMMED_R2} ${FORWARD_READ} ${REVERSE_READ}
done
2) All the R1 and R2 reds are concatenated respectively to run usearch -fastq_eestats2
. This will help to pick the maxEE
.
3) Then I used filterAndTrim
leaving truncLen out but I have to use a really high maxEE(around 20):
filterAndTrim(fnFs, filtFs, rev=fnRs, filt.rev=filtRs, minLen=50, truncQ=2, maxEE=20, rm.phix=TRUE, multithread=TRUE)
For common ITS amplicon strategies, it is undesirable to truncate reads to a fixed length due to the large amount of length variation at that locus. That is OK, just leave out truncLen. Make sure you removed the forward and reverse primers from both the forward and reverse reads though!
Should I trim the reads base on the eestats output?
maxEE=20
is really high, reads with quality that bad are probably hurting more than they are helping to obtain an accurate picture of the community.
Can you get an acceptable number of reads through the filter at lower maxEE
? For example, you would probably be better off losing half the reads in filtering if you can lower maxEE
down to more like 2-4
, but not 90+% of the reads.
giriarteS - I think would avoid using the -q 10 parameter in cutadapt as preprocessing for dada. Quality trimming tails like this will result in lots of length variation in your reads, particularly in the reverse. As a result, the dereplication step will not be as effective. It is not yet clear to me how dada deals with length variation of this type (as opposed to true length variation among RSVs). Maybe Ben can fill us in.
I tried maxEE=c(3,6)
, with this parameters I was able to keep 30% of the reads per sample. Now I am merging my read with PEAR before running DADA2. I will post my results soon.
I found a workflow for preprocessing ITS here. My samples were sequenced at the University of Minnesota Genomic Center (UMGC), using their dual index protocol like the example in this workflow. I followed step 2 to 4 (2. trim adapters, 3. merge with PEAR, 4.Remove 5.8S head and LSU tail) and the quality filter step was done with dada2:
filterAndTrim(fns, filts, minLen=50, maxEE=1, maxN=0, truncQ=2, rm.phix=TRUE,multithread=TRUE, verbose = TRUE)
After that ~90% of the reads/sample remain with ~400 bp average lenght.
@giriarteS, that pretty much mirrors what I'm doing with ITS right now — pre-merging with PEAR doesn't seem to affect the sample inference, but allows for the retention of many more reads than merging after filtering, because the alignment "rescues" low quality reverse reads that would (in the DADA2 pipeline) trigger the high-quality forward weed to be discarded. I did need to add a second filtering step after merging to remove sequences with Ns, which DADA2 can't deal with.
Assigning taxonomy using the UNITE reference database seems to be robust to the inclusion of some ribosomal regions — I seem to get better inference and taxonomy without trimming them. Digging into the literature around ITSx, it seems like one of it's great strengths early on was that it efficiently separated real ITS variants from PhiX contamination, but DADA2 handles that very well. I've not seen any significant benefit to the trimming — just loss of information. I'm open to being convinced though! Anyone have stories about how ITSx or similar has saved them from some mis-interpretation?
I usually run ITSx on the final sequences (extracted from seqtab.nochim table) to add a column (origen) to my tax_table to prune my dataset later. It will we useful if assign taxonomy could output an alignment score to filter base on the percentage of alignment query length and bootstrap values.
Issue #322 is also relevant to this issue.
Thanks all for this - it's really helpful. giriarteS or werdnus - would you be willing to provide code for me to replicate with my data please? I can't quite figure out what you are assigning to "fns, filts" in the filter and trim step. werdnus - are you suggesting just run the dada2 pipeline without any pre-trimming or the reverse-sequenced primers? Many thanks.
This is what I did with my samples
First trim adapters, then merge and remove 5.8S head and LSU tail from merged reads. Run dada2 with merged clean reads. Now reads are ready for dada2:
Load
path <- "path to merged clean reads"
list.files(path)
fns <- sort(list.files(path, pattern="assembled.fastq")) #this are your merged clean reads
sample.names <- sapply(strsplit(fns, ""), [
, 1)
fns <- file.path(path, fns)
Filter reads path
filt_path <- file.path(path, "filtered") # Place filtered files in filtered/ subdirectory
filts <- file.path(filt_path, paste0(sample.names, "_filt.fq.gz"))
Filter
out <- filterAndTrim(fns, filts, minLen=50, maxEE=1, maxN=0, truncQ=2, rm.phix=TRUE,multithread=TRUE, verbose = TRUE)
head(out)
Thanks very much for this - I'll give it a go :)
Hi, thanks again for this. I'm getting the following error:
path <- "/Users/rachaelantwis/Desktop/fungi_test" list.files(path) [1] "Manc_ITS_merged_and_cut.fastq" fns <- sort(list.files(path, pattern="assembled.fastq")) sample.names <- sapply(strsplit(fns, ""), [, 1) Error: unexpected '[' in "sample.names <- sapply(strsplit(fns, ""), ["
Do you know what might be the problem? Many thanks
Use: fns <- sort(list.files(path, pattern="_merged_and_cut.fastq")) instead of fns <- sort(list.files(path, pattern="assembled.fastq"))
Also, changing this line:
sample.names <- sapply(strsplit(fns, ""), [, 1)
to,
sample.names <- sapply(strsplit(fns, "_"), [, 1)
Shall capture your sample name as Manc
Thank you, but still the same problem:
path <- "/Users/rachaelantwis/Desktop/fungi_test" list.files(path) [1] "Manc_merged_and_cut.fastq" fns <- sort(list.files(path, pattern="_merged_andcut.fastq")) sample.names <- sapply(strsplit(fns, ""), [, 1) Error: unexpected '[' in "sample.names <- sapply(strsplit(fns, "_"), ["
(I used fns <- sort(list.files(path, pattern="_merged_and_cut.fastq")) but it's being changed to italics for some reason)
Try:
sample.names <- sapply(strsplit(fns, "_"), [
, 1)
Hmm, also doesn't work :/ could I send you the file perhaps please and see if you can figure it out? Many thanks for your help!
yes is not showing, you need a ` before and after the [
send me the file, do you want to shere a dropbox folder?
Ah thanks, that worked now! So how do the next steps worked if the reads are already merged?
I imagine something like this (but I'm guessing so please correct me if I'm wrong!):
#######LEARNING ERROR RATES
err <- learnErrors(filts, multithread=TRUE)
#Plot Error Rates
plotErrors(err, nominalQ=TRUE)
#######DEREPLICATION / IDENTIFIYNG UNIQUE SEQUENCES
dereps <- derepFastq(filts, verbose=TRUE)
names(dereps) <- sample.names
############INFERENCE OF SEQUENCE VARIANTS IN EACH SAMPLE
dadas <- dada(dereps, err=err, multithread=TRUE)
#insepct
dadas[[1]]
Then this is where I get stuck: how do we get "mergers" to make the sequence variant table, remove chimeras and then track through the pipeline?
###########MERGE PAIRED READS
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
########### MAKE SEQUENCE VARIANT TABLE
seqtab <- makeSequenceTable(mergers)
########### REMOVE CHIMERA
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
dim(seqtab.nochim)
#% Seqeunces Removed
1-sum(seqtab.nochim)/sum(seqtab2)
########### TRACK READS THROUGH PIPELINE
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(mergers, getN), rowSums(seqtab2), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoised", "merged", "tabled", "nonchim")
rownames(track) <- sample.names
head(track)
Thank you!
Sorry, that formatting is all over the shop
Use your dada output: seqtab <- makeSequenceTable(dadas) seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE)
PS @benjjneb benjjneb, I"d be happy to provide some ITS data for you to test this on?
HI @rachaelantwis, I am a post-doc in Ben's lab and working on ITS analysis. It will be great if you could share the test data with us, will surely help me.
Thanks!
Could you send me an email to Rachael.antwis@gmail.com and we'll figure out the best way to share files? Thanks
@giriarteS Would you be able to provide some insight as to how you extract the sequences from your chimera free seq_table and extract the ITS region via ITSx? Thanks!
I use to assign taxonomy with dada2 using the UNITE database (minBoot=80) but later in the statistical analysis the majority of highly abundant unassigned sequences belonged to my host plant or nematodes. I still assign taxonomy with dada2 ((minBoot=80), but I use a specific ITS2 database (5.8S and LSU removed) and I just keep sequences that were identified as fungal ITS2 with the ITSx.
Extract sequences from chimera free SV table:
uniquesToFasta(seqtab.nochim, '/path_to_fasta_file/uniques.fa', ids = paste0("SV", seq(length(getSequences(seqtab.nochim)))))
Run ITSx on them, I just select the fungal ITS (-t F) and my final ITS2 sequences have the 5.8S and LSU removed (--truncate T):
ITSx -i uniques.fa -o output_directorty -p path_to_HMM-profiles --reset T -t F --allow_single_domain --allow_reorder T --cpu 8 --preserve T --save_regions 5.8S,ITS2,LSU --table T --detailed_results T --truncate T
Assign taxonomy and relabel sequence and tax table:
seqtab_final <- seqtab.nochim
colnames(seqtab_final) <- paste0("SV", 1:ncol(seqtab_final))
tax_final <- tax
rownames(tax_final) <- paste0("SV", 1:nrow(tax_final))
Combine data into a phyloseq object and remove non fungal ITS from your phyloseq object.
@giriarteS Thank you for the quick reply. You refer to a specific ITS2 database. Is this readily available or did you custom format it?
I formatted the database. If you want a trained ITS1 or ITS2 database check UTAX
Hi @giriarteS I was able to run the first two blocks of code you provided. However, I'm unsure how i reimport my ITSx'ed data (ITSx_outs.ITS2.fasta). It appears the third block replaces the full sequence names with SV#. Am i missing something obvious? Thanks for you help!
As a follow up, I have managed to play with the sequence file and make a jenky workaround to get it imported and formatted correctly. However, now I am questioning the use of ITSx with the DADA2 pipeline.
After extracting the ITS2 region via ITSx some sequences are flagged as duplicates but were separated when including the 5.8 and LSU regions. If there are basepairs in the 5.8 or LSU region which differentiate the sequences from each other why would we want to remove those regions?
@nagasridhar any thoughts on this as you are working with DADA2 and ITS data?
I use the ITSx results as a guide to safely remove unidentified sequences.
Just follow the normal pipeline in DADA2. Assign taxonomy using the UNITE database:
tax <- assignTaxonomy(seqtab.nochim, ref_UNITE, minBoot = 80, tryRC = TRUE, multithread=TRUE, verbose = TRUE)
Add Ids to each sequence variant:
seqtab_final <- seqtab.nochim
colnames(seqtab_final) <- paste0("SV", 1:ncol(seqtab_final))
tax_final <- tax
rownames(tax_final) <- paste0("SV", 1:nrow(tax_final))
Combine data into a phyloseq object:
metadata_path <- "path_to_sample_data"
samdf <- read.csv(metadata_path, header=TRUE)
all(rownames(seqtab.nochim) %in% samdf$sampleid)
rownames(samdf) <- samdf$sampleid
ps <- phyloseq(tax_table(tax_final), sample_data(samdf),otu_table(seqtab.nochim, taxa_are_rows = FALSE))
Check the ITSx output make a list of non fungal sequences and compare with the ones assigned just to kingdom or phylum. Then get your final list of sequences to remove:
nonFungal = c("SV1", "SV10".........)
And filter the otu table:
allTaxa = taxa_names(ps)
goodTaxa <- allTaxa[!(allTaxa %in% nonFungal)]
ps.1 = prune_taxa(goodTaxa, ps)
Hii @benjjneb.. I have a good quality ITS dataset (4GB, which I analysed last year using DADA2 1.5), I am willing to share if you are still interested.
I think we are OK with data right now, but thanks for the offer @PrashINRA
Hi, may I bother you with a not-fungal question about merging PE reads before running dada()? I am working with a V4V5 16S PE Illumina data set, which I analyzed with the default dada2 pipeline, and with a customized workflow consisting of quality trimming (trimmomatic), merging (PEAR), and only then running dada. I was very surprised that the number of sequence variants with the default pipeline was about 8 times higher than the number of sequence variants generated from the already merged data set. After chimera removal and excluding any singletons which may have been generated by the merging process in the default dada2 pipeline, the difference was still 4-fold. The rarefaction curves generated from both analysis outputs looked vastly different: the default output produces rarefaction curves which are very far from saturation, while merging before dada resulted in rarefaction curves that were already saturated. Comparing the 2 outputs based on inverse Simpson index (as an index that is less sensitive to rare types) also showed higher values for the default output - however, not for all samples. Is it possible that using an input consisting of variable sequence lengths can reduce the sensitivity of dada to detect sequence variants? Or is it rather that the merging in dada2 will inflate the number of sequence variants? Given the large proportion of chimeras detected in by the default pipeline, is it possible that the merging step will introduce chimeras?
Thanks a lot!
Cheers, Christiane
Hi @chassenr We do not recommend merging prior to the dada2 workflow. This is because the quality-score-aware error rates are learned by dada2 under the assumption that they will be the same across the entire read. This is not the case for pre-merged reads. This is because the overlapping part of the reads can have a different relationship between Q and error-rates than the non-overlapping parts.
Thanks @benjjneb for the clarification. I am curious, though, because in one of your previous posts (#322) you mentioned that you may consider implementing a 'merge-first' workflow. Are you still pursuing this approach?
@chassenr We have looked into it, and continue to recommend against pre-merging. Particularly if using PEAR actually. The issue is that many of the read-merging programs do not report realistic quality scores for the merged region, which interferes w/ DADA2 as it uses those quality scores. If you must pre-merge, usearch's read-merging performed the best for this purpose in our tests.
We continue to have interest in developing this in the future, as it simplifies and speeds up the workflow a bit (especially for ITS). However, our development philosophy prioritizes accuracy above all else, and until we can show that a pre-merging method provides equivalent (or better) accuracy as post-merging, we will continue to recommend post-merging.
I believe vsearch implements merging reads using PEAR's algorithm, so it's not quite a drop-in replacement for usearch in this respect.
We have now posted our "official" ITS workflow.
Not that it's the only right way to do things, but it's at least one right way.
I would like to suggest an approach that I have found to dramatically improve the application of dada2 to fungal ITS amplicons. Some adapter removal programs use an alignment of the paired reads to trim the "read through" into the opposite direction adapter/primer. This approach allows trimming most shorter sequence reads to the actual length of the sequence of interest and generally removes low quality tails without needing a fixed length truncation or ending up with lots of length variation introduced by a sliding window quality trimmer. The program SeqPurge seems to work well for this, though there are others. I have also been following this trimming with a fixed-length crop of the forward and reverse reads (after looking to see where the quality drops off) but without throwing out shorter reads using the CROP function in trimmomatic (though you have to apply it to the fwd and rev reads separately if you want different crop lengths). A new fork of cutadapt called atropos looks like it will be able to do both of these trimming steps and can crop the fwd and rev reads to different lengths after adapter trimming, and generally seems like a promising tool for pre-processing paired-end ITS reads. However, I imagine cropping without throwing out shorter reads could also probably be implemented in the filterAndTrim function in dada2 fairly easily.
Before adding these two steps I was recovering much lower numbers of fungal ITS sequence variants than I expected with DADA2. Now the numbers are more in line with my expectation and are more similar to what I see with other denoising approaches (i.e., unoise3). So far I think this approach is working well for me, but I would be interested to hear what others think.