Closed shreyaskumbhare closed 6 years ago
Yes, you can do what you are suggesting.
My suggestion would be that you process the V3V4 and V4 datasets all the way to the ASV table first, then trim the V3V4 sequences to the V4 region. At that point you should be able to merge the datasets together, as the ASVs will cover the same gene region.
Thank you for the reply. I wish to know whether this trimming part can be done firstly by aligning the V3-V4 assembled reads with a reference sequence (well annotated for the 16S rRNA gene regions) and then trim them down only to V3 region, in DADA2? If yes, I would really appreciate if you can share the details? Thanks again!
The dada2 R package doesn't support aligning and trimming sequences based on external reference sequences. Inwould recommend you look at the DECIPHER R package from @digitalwright and the Biostrings package from Bioconductor.
Hi, I am back with a query. I haven't started with the analysis, however have gone through the tutorial and pipeline of DADA2. From what I understand trimming the V3 region from the V3-V4 data after alignment will need a lot of computational power and time. So I have now decided to go with your suggestion of trimming down the V3V4 to V3 after constructing the ASV table. Can you please elaborate a bit on this? Thanks!
From what I understand trimming the V3 region from the V3-V4 data after alignment will need a lot of computational power and time.
I don't think it would. You would just need to match the V3 reverse primer to your denoised sequences, and then cut at that position. Since denoising reduces the number of sequences so much, that's a pretty "easy" problem computationally.
So I have now decided to go with your suggestion of trimming down the V3V4 to V3 after constructing the ASV table. Can you please elaborate a bit on this?
If going that route, you are probably going to want to use an external software program, such as cutadapt or trimmomatic, and truncate your reads (pre or post-merging) at the V3 reverse primer.
Note that the DECIPHER package (TrimDNA
function) can also trim reads based on primer sequences and/or quality scores.
Thanks!
Erik, It would be great to provide an example with a few lines of code for trimming primers as this is the most faq on the github issues forum!! Thanks for providing such a great tool! Susan
On Wed, Jul 18, 2018 at 10:52 AM, digitalwright notifications@github.com wrote:
Note that the DECIPHER package (TrimDNA function) can also trim reads based on primer sequences and/or quality scores.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/509#issuecomment-405959653, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvRgif80EIh5X-m5zGe4tUU2ynJPjks5uH0vFgaJpZM4VA9ya .
-- Susan Holmes John Henry Samter Fellow in Undergraduate Education Professor, Statistics 2017-2018 CASBS Fellow, Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/
@spholmes @benjjneb Hi, I am bit confused with this trimming part. I would really appreciate if someone can clear my doubts for once:
Now here is my question: do the demultiplexed sequences obtained at the end still contain the first adapter (overhang green colored in the above image)? Because tools like FastQC show no adapter content in these demultiplexed sequence files.
I would really appreciate if someone can clarify these doubts, thanks!
Responding to @shreyaskumbhare :
Typically your Illumina sequences start with your amplicon primer, although it depends on your experimental design. Per the DADA2 documentation, you can trim the 5'-end of reads by specifying the trimLeft
argument in filterAndTrim()
.
Responding to @spholmes :
The DECIPHER
package has a function, TrimDNA()
, for filtering reads based on quality scores and trimming primers off the ends of sequences. It takes a DNAStringSet as input, along with any expected leftPatterns
(5'-end) or rightPatterns
(3'-end). It will find any patterns (e.g., primer sequences) that overlap with the ends of the sequences and remove them.
For example, if I have a set of forward reads that used the primer: CCTACGGGNGGCWGCAG (note that ambiguities, such as "N", are supported) This could be removed from the sequences in R with:
> library(DECIPHER, quiet=TRUE)
> dna <- readDNAStringSet("path/to/filename_R1.fastq.gz", format="fastq")
> trimmed <- TrimDNA(dna, type="sequences", "CCTACGGGNGGCWGCAG", "")
Finding left pattern: 81.8% internal, 0.1% flanking
Time difference of 0.53 secs
Similarly, we could remove GACTACHVGGGTATCTAATCC from the reverse reads with:
dna <- readDNAStringSet("path/to/filename_R2.fastq.gz", format="fastq")
> trimmed <- TrimDNA(dna, type="sequences", "GACTACHVGGGTATCTAATCC", "")
Finding left pattern: 97.6% internal, 0.2% flanking
Time difference of 0.52 secs
The TrimDNA() function can also filter low quality regions from the start and ends of reads:
> dna <- readDNAStringSet("path/to/filename_R1.fastq.gz", format="fastq", with.qualities=TRUE)
> trimmed <- TrimDNA(dna, "", "", type="sequences", quality=PhredQuality(mcols(dna)$qualities))
Trimming by quality score: 20% left, 100% right
Time difference of 0.43 secs
I recommend reading ?TrimDNA
for more information.
Super. This will be a very useful reference.
On Fri, Aug 3, 2018, 18:13 digitalwright notifications@github.com wrote:
Responding to @shreyaskumbhare https://github.com/shreyaskumbhare :
Typically your Illumina sequences start with your amplicon primer, although it depends on your experimental design. Per the DADA2 documentation, you can trim the 5'-end of reads by specifying the trimLeft argument in filterAndTrim().
Responding to @spholmes https://github.com/spholmes :
The DECIPHER package has a function, TrimDNA(), for filtering reads based on quality scores and trimming primers off the ends of sequences. It takes a DNAStringSet as input, along with any expected leftPatterns (5'-end) or rightPatterns (3'-end). It will find any patterns (e.g., primer sequences) that overlap with the ends of the sequences and remove them.
For example, if I have a set of forward reads that used the primer: CCTACGGGNGGCWGCAG (note that ambiguities, such as "N", are supported) This could be removed from the sequences in R with:
library(DECIPHER, quiet=TRUE) dna <- readDNAStringSet("path/to/filename_R1.fastq.gz", format="fastq") trimmed <- TrimDNA(dna, type="sequences", "CCTACGGGNGGCWGCAG", "") Finding left pattern: 81.8% internal, 0.1% flanking
Time difference of 0.53 secs
Similarly, we could remove GACTACHVGGGTATCTAATCC from the reverse reads with:
dna <- readDNAStringSet("path/to/filename_R2.fastq.gz", format="fastq")
trimmed <- TrimDNA(dna, type="sequences", "GACTACHVGGGTATCTAATCC", "") Finding left pattern: 97.6% internal, 0.2% flanking
Time difference of 0.52 secs
The TrimDNA() function can also filter low quality regions from the start and ends of reads:
dna <- readDNAStringSet("path/to/filename_R1.fastq.gz", format="fastq", with.qualities=TRUE) trimmed <- TrimDNA(dna, "", "", type="sequences", quality=PhredQuality(mcols(dna)$qualities)) Trimming by quality score: 20% left, 100% right
Time difference of 0.43 secs
I recommend reading ?TrimDNA for more information.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/509#issuecomment-410389865, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvehtXBjMn6eBcYBm7-mhLhPHYvXQks5uNMsOgaJpZM4VA9ya .
Hello all, I would like to ask you for some help. My aim is to compare 16S amplicon datasets obtained in two different runs and using two different primers sets. Some data was obtained by amplified V3V4 region using 341F/785R primers (group 1), the other datasets were obtained by amplified V4 region using 515F/806R primers (group 2). As Dr. Callahan suggested, I have processed both groups of data separately and obtained the seqtab for each group (seqtab 1 and seqtab 2). Next, I have used cutadapt and 515F primer sequence to trim the V3V4 datasets (group 1). Thus, both groups start on the same position. The fasta file with trimmed sequences was generated. Now I should modify seqtab 1 and implement these trimmed data to seqtab 1. I would ask you for some advice: how to do that. The analyzed fragment should have a fixed length, in my case it should be 785-515=270 nucl. So my second question is: how to trim the group 2 to that length? And further, how to modify seqtab2 taking into account the trimmed data? How should I deal with that? Is the next step to merge datasets using MergeSequenceTables and remove chimeras ? Thank you in advance.
@JoBrz
The analyzed fragment should have a fixed length, in my case it should be 785-515=270 nucl.
Your sequences will not all have that fixed length, there is biological length variation in the lengths of different segments of the 16S gene. Those coordinates values are taken from the E. coli gene and are not the same for all other taxa.
Now I should modify seqtab 1 and implement these trimmed data to seqtab 1. I would ask you for some advice: how to do that.
Read in the cutadapted sequences, and replace the sequence names in seqtab 1 with those new sequences. Assuming it is a fasta file, this is straightforward:
sq.cut <- getSequences("myseqs_cutadapt.fa")
colnames(seqtab1) <- sq.cut
So my second question is: how to trim the group 2 to that length? And further, how to modify seqtab2 taking into account the trimmed data?
You can perform the same cutadapt strategy, this time to trim the seqtab2 sequences based on the 785R primer. This would result in sequences from both datasets spanning the region from 515F-785R, which would then be direclty comparable and appropriate for mergeSequenceTables
.
Hi All, Thanks for the very helpful thread when dealing with this challenge. Like others I have sequences from 341F-785R and 515F-785R, I have processed them through the tutorial pipeline to get a sequence table for each dataset. Where I am stuck is how would I go about trimming the 341F-785R through a tool like cutadapt, when I use the getSequence commands I get a list of characters but I think cutadapt needs a fasta/fastq file to work?
One general purpose appraoch that we used in our recent meta-analysis of the vaginal microbiome and preterm birth is:
To obtain comparable ASVs among datasets, we divided the datasets into two groups based on the region of the 16S gene that was sequenced (V1-V2 and V4) with five datasets in the V1-V2 group and seven datasets in the V4 group. Then we truncated the original ASVs separately for each group to a common V1-V2 or V4 region in three steps: (1) align the original ASVs to the SILVA reference database using the mothur software (Schloss et al.,2009); (2) identify the overlapping sequencing region common to all ASVs in the group using an alignment visualization tool (MSAviewer); (3) truncate the original ASVs and remove alignment gaps using the extractalign and degapseq commands.
That may be a bit more heavy-weight than is needed when you are putting together just two datasets (we were dealing with 5-8 datasets with varying primers sets), but will do the job.
Dear Dr. Benjamin, I have been reading the DADA2 article of yours published in Nature methods. I am working on a microbiome project in which I need to compare microbiome (16S amplicon) data from two cohorts. However the sequence data of the first cohort is of V3-V4 region sequenced with 2 x 300 bp chemistry on Illumina, while the other cohort sequence data is of V4 region sequenced with a 2 x 250 bp chemistry on Illumina platform. In order to compare these cohorts I was wondering if following things can be done using DADA2 and I also seek your suggestions on the following:
Assemble the paired end reads and than trim out the V3 region from the assembled reads of cohort 1 (V3-V4 region).
Use these trimmed reads and than compare with the other cohort (V4 region) data.
I have been using QIIME and mothur previously, however I am beginner in using DADA2, so would really appreciate if you could suggest and help me in resolving the above mentioned issue. Thanks!