benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

ITS workflow in dada2 #395

Closed anmwinter closed 6 years ago

anmwinter commented 6 years ago

Morning all,

I am using dada2 1.5.2 in RStudio. I have have a large dataset (~250 samples) of ITS amplicons from one of the early 2x300 bp machines and chemistry. The data set was hanging around until a bioinformatics person came along (me!).

I have a mock community (8 memebers) that was run along with the samples. I've dug through the dada2 issues here and took a stab at running the mock communities.

The quality is pretty ragged as to be expected from the early machine and chemistry. quality_v_cycle_mock_r1

quality_v_cycle_mock_r2

I've tried a few things in dada2:

`out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(150,125), maxN=0, maxEE=c(6,6), truncQ=3, rm.phix=TRUE, compress=TRUE, multithread=TRUE)

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, minLen=50, maxN=0, maxEE=c(6,6), truncQ=3, rm.phix=TRUE, compress=TRUE, multithread=TRUE)`

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, minOverlap = 20, maxMismatch = 3, ,returnRejects = TRUE, verbose=TRUE)

I've held the mergedPairs constant while fiddling with the filterAndTrim. The second filter and trim allows many more sequences through. However in both cases we only recover 4 OTUs at most. We know from some other software such as usearch and vsearch we can recover 6-8 OTUs. Two of the OTUs are in super low abundance.

I can share the raw mock community data run on the older 2x300 bp machine and chemistry.

We also ran the mock community on several different machines and as recently as 6 months ago. This is much higher quality data but not useful for the project I am working on. I can share this as well once I get a copy.

Library construction details of the samples. The mock community used the same primers Soil cores were collected, pooled and kept on liquid N until lyophilized. DNA was extracted using the MoBio PowerMax kit then normalized to 4ng/uL. Each soil sample was amplified using 2 barcoded primers with 4 replicates each. The primers included Illumina adaptors and golay barcodes. The base PCR primers were 5.8S_LT1 (AACTTTYRRCAAYGGATCWCT) and ITS4_mod_long (AGCCTCCGCTTATTGATATGCTTAART).

benjjneb commented 6 years ago

@nagasridhar and I are currently working on improving our workflow specifically for ITS data, so we would definitely be interested in taking a look at that mock community data if you are willing to share!

anmwinter commented 6 years ago

@benjjneb here is the link: https://drive.google.com/open?id=1adZW0oY0fIZU5yS1CgRGbzWDh1WIzBRI

I'll need to dig up the ancillary data like DNA concentrations and such. There is Mock A and Mock B. These are the same 8 OTUs and primers but different barcodes.

nagasridhar commented 6 years ago

@bioinfonm Thank you for sharing the data. Based on the naming of the files, I would assume the primers are trimmed. Is that correct.

Thanks!

anmwinter commented 6 years ago

@nagasridhar Sorry! Yes, I used cutadapt on them to remove barcodes, primers, etc.

ChrisTrivedi commented 5 years ago

Hi, I know this is a closed issue, but I figured this might be the best place to post since I think my problem is more with my data rather than how DADA2 is handling it.

I am also working with PE 2x300 bp ITS data. In this case, using a primer set for ITS2.

I removed primers via the ITS workflow instructions: Before -

FWD.ForwardReads  276748          0       0       0
FWD.ReverseReads      63          0       0      19
REV.ForwardReads      87          0       0      72
REV.ReverseReads  216949          0       0       0

After -

FWD.ForwardReads       0          0       0       0
FWD.ReverseReads      63          0       0       0
REV.ForwardReads      87          0       0       0
REV.ReverseReads       0          0       0       0

So far so good. Inspecting quality profiles - F looked okay, R looks pretty poor (expected on v3 2x300):

Screen Shot 2019-03-23 at 9 19 58 PM

Filter and trimming is where I'm getting hung up - with default options my reads tank:

                                       reads.in reads.out
IS18-2-2-E03_S21_L001_R1_001.fastq.gz   288435        35

It's definitely a maxEE error after some testing. If I bump it to maxEE=c(3, 15) things recover nicely:

                                       reads.in reads.out
IS18-2-2-E03_S21_L001_R1_001.fastq.gz   288435    215747
IS18-5-2-F03_S22_L001_R1_001.fastq.gz   360163    261926
IS18-7-2-G03_S23_L001_R1_001.fastq.gz   251727    193627

I know there have been many discussions about maxEE here but is this too high? It stands to reason that since ITS is more variable than 16S/18S it could probably be set higher in this case, but I don't know. Trudging along I generated error profiles and my woes with the reverse reads continue:

Screen Shot 2019-03-23 at 9 15 07 PM

At what point do I stop because I can't be confident of my downstream results? I guess this is more of a "what would you do in this situation?" post. Any help or info is greatly appreciated. Thanks.

benjjneb commented 5 years ago

At what point do I stop because I can't be confident of my downstream results? I guess this is more of a "what would you do in this situation?" post. Any help or info is greatly appreciated. Thanks.

First off just wanted to say that you are doing all the right sanity checking here. I agree that you should be concerned about these R read results. Even for Illumina 2x300, that is poor reverse read quality, and the low-Q bases start early in the reads. The plotErrors plots show that dada2 is having a difficult time fitting its error model, and is probably significantly overestimating the error rates at high quality scores.

The first thing I would try is to see if "contaminating" artefactual sequences might be driving this, by which I mean off-target sequences (e.g amplified from some eukaryotic source) or low complexity sequences. To check the first, I would dereplicate one reverse file, and then just BLAST some of the sequences against nt, e.g dada2:::pfasta(getSequences(derepFastq("sampleX_rev.fastq"))[1:20]) and then BLAST against nt, to see if most of them are hitting ITS sequences or not. Second, I would check the complexity profile of a sample to see if there is a large low-complexity mode: hist(seqComplexity(getSequences("sampleX_rev.fastq")), 100).

Most likely this is just a quality issue though. In that case, I would drop the reverse reads and just work with the forward reads, or I would pre-merge the reads. Pre-merging is not recommended for the dada2 pipeline, but can be better than independent processing then merging in cases like this. I would probably just work with the forward-reads myself though.

ChrisTrivedi commented 5 years ago

Hi Ben,

Thank you for the speedy response and well though-out reply. Also, thank you for a really awesome piece of R software!

First off just wanted to say that you are doing all the right sanity checking here. I agree that you should be concerned about these R read results. Even for Illumina 2x300, that is poor reverse read quality, and the low-Q bases start early in the reads. The plotErrors plots show that dada2 is having a difficult time fitting its error model, and is probably significantly overestimating the error rates at high quality scores.

This helps to know that I'm at least taking the right steps. I'm not sure why the R reads look so poor, but it may have to do with the ITS2 region we are trying to sequence. For reference, we also had 16S and 18S on this run, and while the R reads leave something to be desired, they aren't nearly as bad as what we see here for the ITS.

The first thing I would try is to see if "contaminating" artefactual sequences might be driving this, by which I mean off-target sequences (e.g amplified from some eukaryotic source) or low complexity sequences. To check the first, I would dereplicate one reverse file, and then just BLAST some of the sequences against nt, e.g dada2:::pfasta(getSequences(derepFastq("sampleX_rev.fastq"))[1:20]) and then BLAST against nt, to see if most of them are hitting ITS sequences or not. Second, I would check the complexity profile of a sample to see if there is a large low-complexity mode: hist(seqComplexity(getSequences("sampleX_rev.fastq")), 100).

I will certainly do this and let you know what I discover. Thanks for the tips.

Most likely this is just a quality issue though. In that case, I would drop the reverse reads and just work with the forward reads, or I would pre-merge the reads. Pre-merging is not recommended for the dada2 pipeline, but can be better than independent processing then merging in cases like this. I would probably just work with the forward-reads myself though.

This is what I was considering as well. It's hard to make that decision to drop the reverse reads - I keep telling myself that there's got to be some way to salvage them. Thanks for your help. Much appreciated!