Comparing data from two Illumina chemistries (16S amplicon sequencing)

shreyaskumbhare commented 6 years ago

Dear Dr. Benjamin, I have been reading the DADA2 article of yours published in Nature methods. I am working on a microbiome project in which I need to compare microbiome (16S amplicon) data from two cohorts. However the sequence data of the first cohort is of V3-V4 region sequenced with 2 x 300 bp chemistry on Illumina, while the other cohort sequence data is of V4 region sequenced with a 2 x 250 bp chemistry on Illumina platform. In order to compare these cohorts I was wondering if following things can be done using DADA2 and I also seek your suggestions on the following:

Assemble the paired end reads and than trim out the V3 region from the assembled reads of cohort 1 (V3-V4 region).
Use these trimmed reads and than compare with the other cohort (V4 region) data.

I have been using QIIME and mothur previously, however I am beginner in using DADA2, so would really appreciate if you could suggest and help me in resolving the above mentioned issue. Thanks!

benjjneb commented 6 years ago

Yes, you can do what you are suggesting.

My suggestion would be that you process the V3V4 and V4 datasets all the way to the ASV table first, then trim the V3V4 sequences to the V4 region. At that point you should be able to merge the datasets together, as the ASVs will cover the same gene region.

shreyaskumbhare commented 6 years ago

Thank you for the reply. I wish to know whether this trimming part can be done firstly by aligning the V3-V4 assembled reads with a reference sequence (well annotated for the 16S rRNA gene regions) and then trim them down only to V3 region, in DADA2? If yes, I would really appreciate if you can share the details? Thanks again!

benjjneb commented 6 years ago

The dada2 R package doesn't support aligning and trimming sequences based on external reference sequences. Inwould recommend you look at the DECIPHER R package from @digitalwright and the Biostrings package from Bioconductor.

shreyaskumbhare commented 6 years ago

Hi, I am back with a query. I haven't started with the analysis, however have gone through the tutorial and pipeline of DADA2. From what I understand trimming the V3 region from the V3-V4 data after alignment will need a lot of computational power and time. So I have now decided to go with your suggestion of trimming down the V3V4 to V3 after constructing the ASV table. Can you please elaborate a bit on this? Thanks!

benjjneb commented 6 years ago

From what I understand trimming the V3 region from the V3-V4 data after alignment will need a lot of computational power and time.

I don't think it would. You would just need to match the V3 reverse primer to your denoised sequences, and then cut at that position. Since denoising reduces the number of sequences so much, that's a pretty "easy" problem computationally.

So I have now decided to go with your suggestion of trimming down the V3V4 to V3 after constructing the ASV table. Can you please elaborate a bit on this?

If going that route, you are probably going to want to use an external software program, such as cutadapt or trimmomatic, and truncate your reads (pre or post-merging) at the V3 reverse primer.

digitalwright commented 6 years ago

Note that the DECIPHER package (TrimDNA function) can also trim reads based on primer sequences and/or quality scores.

shreyaskumbhare commented 6 years ago

Thanks!

spholmes commented 6 years ago

Erik, It would be great to provide an example with a few lines of code for trimming primers as this is the most faq on the github issues forum!! Thanks for providing such a great tool! Susan

On Wed, Jul 18, 2018 at 10:52 AM, digitalwright notifications@github.com wrote:

Note that the DECIPHER package (TrimDNA function) can also trim reads based on primer sequences and/or quality scores.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/509#issuecomment-405959653, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvRgif80EIh5X-m5zGe4tUU2ynJPjks5uH0vFgaJpZM4VA9ya .

-- Susan Holmes John Henry Samter Fellow in Undergraduate Education Professor, Statistics 2017-2018 CASBS Fellow, Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

shreyaskumbhare commented 6 years ago

@spholmes @benjjneb Hi, I am bit confused with this trimming part. I would really appreciate if someone can clear my doubts for once:

I have been reading on this Illumina chemistry and found that there are introduction of adapters and primers twice during the workflow: once during the initial amplification from the genomic DNA and then the adapters with the indexes, will be much clearer from the image below:

Now here is my question: do the demultiplexed sequences obtained at the end still contain the first adapter (overhang green colored in the above image)? Because tools like FastQC show no adapter content in these demultiplexed sequence files.

How exactly the primers affect the variant detection in DADA2?

I would really appreciate if someone can clarify these doubts, thanks!

digitalwright commented 6 years ago

Responding to @shreyaskumbhare :

Typically your Illumina sequences start with your amplicon primer, although it depends on your experimental design. Per the DADA2 documentation, you can trim the 5'-end of reads by specifying the trimLeft argument in filterAndTrim().

Responding to @spholmes :

The DECIPHER package has a function, TrimDNA(), for filtering reads based on quality scores and trimming primers off the ends of sequences. It takes a DNAStringSet as input, along with any expected leftPatterns (5'-end) or rightPatterns (3'-end). It will find any patterns (e.g., primer sequences) that overlap with the ends of the sequences and remove them.

For example, if I have a set of forward reads that used the primer: CCTACGGGNGGCWGCAG (note that ambiguities, such as "N", are supported) This could be removed from the sequences in R with:

> library(DECIPHER, quiet=TRUE)
> dna <- readDNAStringSet("path/to/filename_R1.fastq.gz", format="fastq")
> trimmed <- TrimDNA(dna, type="sequences", "CCTACGGGNGGCWGCAG", "")
Finding left pattern: 81.8% internal, 0.1% flanking

Time difference of 0.53 secs

Similarly, we could remove GACTACHVGGGTATCTAATCC from the reverse reads with:

dna <- readDNAStringSet("path/to/filename_R2.fastq.gz", format="fastq")
> trimmed <- TrimDNA(dna, type="sequences", "GACTACHVGGGTATCTAATCC", "")
Finding left pattern: 97.6% internal, 0.2% flanking

Time difference of 0.52 secs

The TrimDNA() function can also filter low quality regions from the start and ends of reads:

> dna <- readDNAStringSet("path/to/filename_R1.fastq.gz", format="fastq", with.qualities=TRUE)
> trimmed <- TrimDNA(dna, "", "", type="sequences", quality=PhredQuality(mcols(dna)$qualities))
Trimming by quality score: 20% left, 100% right

Time difference of 0.43 secs

I recommend reading ?TrimDNA for more information.

spholmes commented 6 years ago

Super. This will be a very useful reference.

On Fri, Aug 3, 2018, 18:13 digitalwright notifications@github.com wrote:

Responding to @shreyaskumbhare https://github.com/shreyaskumbhare :

Typically your Illumina sequences start with your amplicon primer, although it depends on your experimental design. Per the DADA2 documentation, you can trim the 5'-end of reads by specifying the trimLeft argument in filterAndTrim().

Responding to @spholmes https://github.com/spholmes :

The DECIPHER package has a function, TrimDNA(), for filtering reads based on quality scores and trimming primers off the ends of sequences. It takes a DNAStringSet as input, along with any expected leftPatterns (5'-end) or rightPatterns (3'-end). It will find any patterns (e.g., primer sequences) that overlap with the ends of the sequences and remove them.

For example, if I have a set of forward reads that used the primer: CCTACGGGNGGCWGCAG (note that ambiguities, such as "N", are supported) This could be removed from the sequences in R with:

library(DECIPHER, quiet=TRUE) dna <- readDNAStringSet("path/to/filename_R1.fastq.gz", format="fastq") trimmed <- TrimDNA(dna, type="sequences", "CCTACGGGNGGCWGCAG", "") Finding left pattern: 81.8% internal, 0.1% flanking

Time difference of 0.53 secs

Similarly, we could remove GACTACHVGGGTATCTAATCC from the reverse reads with:

dna <- readDNAStringSet("path/to/filename_R2.fastq.gz", format="fastq")

trimmed <- TrimDNA(dna, type="sequences", "GACTACHVGGGTATCTAATCC", "") Finding left pattern: 97.6% internal, 0.2% flanking

Time difference of 0.52 secs

The TrimDNA() function can also filter low quality regions from the start and ends of reads:

dna <- readDNAStringSet("path/to/filename_R1.fastq.gz", format="fastq", with.qualities=TRUE) trimmed <- TrimDNA(dna, "", "", type="sequences", quality=PhredQuality(mcols(dna)$qualities)) Trimming by quality score: 20% left, 100% right

Time difference of 0.43 secs

I recommend reading ?TrimDNA for more information.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/509#issuecomment-410389865, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJcvehtXBjMn6eBcYBm7-mhLhPHYvXQks5uNMsOgaJpZM4VA9ya .

JoBrz commented 4 years ago

Hello all, I would like to ask you for some help. My aim is to compare 16S amplicon datasets obtained in two different runs and using two different primers sets. Some data was obtained by amplified V3V4 region using 341F/785R primers (group 1), the other datasets were obtained by amplified V4 region using 515F/806R primers (group 2). As Dr. Callahan suggested, I have processed both groups of data separately and obtained the seqtab for each group (seqtab 1 and seqtab 2). Next, I have used cutadapt and 515F primer sequence to trim the V3V4 datasets (group 1). Thus, both groups start on the same position. The fasta file with trimmed sequences was generated. Now I should modify seqtab 1 and implement these trimmed data to seqtab 1. I would ask you for some advice: how to do that. The analyzed fragment should have a fixed length, in my case it should be 785-515=270 nucl. So my second question is: how to trim the group 2 to that length? And further, how to modify seqtab2 taking into account the trimmed data? How should I deal with that? Is the next step to merge datasets using MergeSequenceTables and remove chimeras ? Thank you in advance.

benjjneb commented 4 years ago

@JoBrz

The analyzed fragment should have a fixed length, in my case it should be 785-515=270 nucl.

Your sequences will not all have that fixed length, there is biological length variation in the lengths of different segments of the 16S gene. Those coordinates values are taken from the E. coli gene and are not the same for all other taxa.

Now I should modify seqtab 1 and implement these trimmed data to seqtab 1. I would ask you for some advice: how to do that.

Read in the cutadapted sequences, and replace the sequence names in seqtab 1 with those new sequences. Assuming it is a fasta file, this is straightforward:

sq.cut <- getSequences("myseqs_cutadapt.fa")
colnames(seqtab1) <- sq.cut

So my second question is: how to trim the group 2 to that length? And further, how to modify seqtab2 taking into account the trimmed data?

You can perform the same cutadapt strategy, this time to trim the seqtab2 sequences based on the 785R primer. This would result in sequences from both datasets spanning the region from 515F-785R, which would then be direclty comparable and appropriate for mergeSequenceTables.

dbro970 commented 1 year ago

Hi All, Thanks for the very helpful thread when dealing with this challenge. Like others I have sequences from 341F-785R and 515F-785R, I have processed them through the tutorial pipeline to get a sequence table for each dataset. Where I am stuck is how would I go about trimming the 341F-785R through a tool like cutadapt, when I use the getSequence commands I get a list of characters but I think cutadapt needs a fasta/fastq file to work?

benjjneb commented 1 year ago

One general purpose appraoch that we used in our recent meta-analysis of the vaginal microbiome and preterm birth is:

To obtain comparable ASVs among datasets, we divided the datasets into two groups based on the region of the 16S gene that was sequenced (V1-V2 and V4) with five datasets in the V1-V2 group and seven datasets in the V4 group. Then we truncated the original ASVs separately for each group to a common V1-V2 or V4 region in three steps: (1) align the original ASVs to the SILVA reference database using the mothur software (Schloss et al.,2009); (2) identify the overlapping sequencing region common to all ASVs in the group using an alignment visualization tool (MSAviewer); (3) truncate the original ASVs and remove alignment gaps using the extractalign and degapseq commands.

That may be a bit more heavy-weight than is needed when you are putting together just two datasets (we were dealing with 5-8 datasets with varying primers sets), but will do the job.

benjjneb / dada2

Comparing data from two Illumina chemistries (16S amplicon sequencing) #509