benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

dada2 - filtering parameters and batch effects #876

Closed yakshiUPR closed 4 years ago

yakshiUPR commented 4 years ago

Dear all,

I am working with two ITS datasets (each with 96 samples sequenced in two MiSeq runs) from rhizosphere and bulk soil samples. These samples were PCR-amplified using 5.8S-Fun and ITS4-Fun primers. Although I processed the datasets through qiime2, I followed the dada2 R workflow suggested for ITS (https://benjjneb.github.io/dada2/ITS_workflow.html) and filtered and trimmed each dataset (ds1 and ds2) using a maxEE = 2 y truncQ = 2. Afterwards, I generated an ASV table and assigned taxonomy. During the filtering and trimming steps I lost 80% of reads in ds1 and 63% of the reads in ds2. This prompted me to evaluate the effects of using different maxEE and trunQ combinations on the number of sequences and ASV’s (Fig. 1) using ds1.

dada2-truncq Figure 1. The effect of different maxE and truncQ values on the number of sequences and features (in ds1). The solid line corresponds to the number of features, while the dotted-line refers to the total number of sequences.

I noticed that low truncQ values removed too many reads in the filtering step, while with the high truncQ values, many reads failed to merge. Using maxEE and truncQ values of 1 and 8, respectively, helped me retained more reads and ASV’s yet I started to observe a significant batch effect in the combined datasets analyses (Fig. 2).

pcoa-dada2 Figure 2. Unweighted Unifrac PCoA –The color differentiates the MiSeq runs. The bigger points are two technical replicated that were placed in both MiSeq runs.

This batch effect is lost if I select the other combinations of parameters listed in table 1, but those combinations of parameters don’t convince me due to what I stated above.

Table 1. Number of reads and max number of reads per sample in two datasets after filtering the data with different dada2 parameters

    Pre-dada2   Post-dada2      
  MaxEE n/a   0.5 2 1 2
  truncQ n/a   2 2 8 20
ds1 Total number of reads 15,583,074   333,735 3,157,865 4,955,296 169
  Max num of reads per sample 2,073,362   55,164 461,819 678,860 44,266
ds2 Total number of reads 11,958,867   1,658,972 4,417,740 2,879,610 952,968
  Max num of reads per sample 432,423   74,593 191,075 122,394 148,757

Do you think this batch effect might be related to dada2 or do you think it is related to the sequencing?

It might be important to point out that there is a sample in ds1 that has 2M reads, while the rest of the samples have ~300,000 reads. Is this something that you think might be causing some issues with the sample inference step in dada2?

I noticed that the parameters that I prefer to use (maxEE =1 & truncQ = 8) might not be optimal for both datasets, thus that leaves me with two additional questions:

Is there an easier way of finding the optimal parameters to filter each dataset?

Is it valid to filter different datasets using different parameters and then combine those datasets to do the rest of the analyses?

Thanks for an awesome tool and for any help or insights that you can bring,

Yakshi

benjjneb commented 4 years ago

To clarify: Did you remove the primers, both forward and reverse, as in the DADA2 ITS tutorial? Also, can you post the results of plotQualityProfile from an example sample (forward and reverse) from each of the batches?

Typically, I do not change truncQ and rely on maxEE to do the quality filtering. This is partly because truncQ based trimming makes the denoising step somewhat less efficient later, since it introduces unnecessary length variation into the data.

Do you think this batch effect might be related to dada2 or do you think it is related to the sequencing?

It's almost certainly related to the sequencing in some way, but we would like to understand if dada2 is exacerbating or ameliorating that batch-to-batch variability. If operating properly, it should actually reduce the variability somewhat.

It might be important to point out that there is a sample in ds1 that has 2M reads, while the rest of the samples have ~300,000 reads. Is this something that you think might be causing some issues with the sample inference step in dada2?

I don't think so. It can cause effects in the subsequent ordination though, as dramatic depth differences are known to interact with common community dissimilarity metrics, especially unweighted Unifrac. For example, it is possible the batch effect you are seeing could be entirely driven by different average library depths. I'd try another ordination with a metric less affected by library depth (e.g. Bray curtis) and see if the ordination batching is reduced.

Is it valid to filter different datasets using different parameters and then combine those datasets to do the rest of the analyses?

Yes, as long as the final amplicon covers the same gene region.

One other thing to consider -- some folks (including myself) have found that using just forward reads in ITS data is as or more effective than using merged ITS reads (e.g. this recent publication). One important reason for this is that merging typically results in the loss of certain taxa with ITS regions so long that the paired reads don't overlap at all.

yakshiUPR commented 4 years ago

benj,

Thanks for replying.

To clarify: Did you remove the primers, both forward and reverse, as in the DADA2 ITS tutorial?

Yes, I did remove the primers. I am trying to re-do the quality plots from dada2 in R but I am getting an error. So, while I deal with that, here are the quality plots from qiime2 after removing the primers.

quality plot - ds1 QualityPlot-Plate1

quality plot - ds2 QualityPlot-Plate2

Typically, I do not change truncQ and rely on maxEE to do the quality filtering. This is partly because truncQ based trimming makes the denoising step somewhat less efficient later

Does this mean that by adding read's length variation, I might end up with what its suppose to be 1 ASV splitted into multiple ASVs?

...it is possible the batch effect you are seeing could be entirely driven by different average library depths.

The ordination above was done after rarefying all the samples to the same depth. Thus, I don't think that the batch effect that I am seeing is due to a depth issue, right?

forward reads in ITS data is as or more effective than using merged ITS reads

I will definitely look into this. Thanks!

Yakshi

benjjneb commented 4 years ago

Does this mean that by adding read's length variation, I might end up with what its suppose to be 1 ASV splitted into multiple ASVs?

This can happen, but usually they will still be collapsed together. The bigger issue is that yo uwill lose sensitivity, as when a single real ASV is split over lots of different truncated lengths, it is harder to detect it as real.

The ordination above was done after rarefying all the samples to the same depth. Thus, I don't think that the batch effect that I am seeing is due to a depth issue, right?

Agree.

I will definitely look into this. Thanks!

I would try forward reads alone, and (although we don't generally recommend this) pre-merged reads as well, to see if those approaches reduce the apparent batch effects. The challenge with this data is that its ITS so we don't want to use a truncation length, but it also really drops off in quality towards the ends of the reads, which makes everything noisier and more difficult.

yakshiUPR commented 4 years ago

What do you mean by a "pre-merged reads approach"? To merge the sequences prior to the truncQ?

The challenge with this data is that its ITS so we don't want to use a truncation length, but it also really drops off in quality towards the ends of the reads, which makes everything noisier and more difficult.

I know! This has been the real struggle. So, should I try the forward only using the truncQ =8? Or you do not recommend this at all?

Yakshi

benjjneb commented 4 years ago

What do you mean by a "pre-merged reads approach"? To merge the sequences prior to the truncQ?

Merged before denoising using some external program like PEAR or usearch.

So, should I try the forward only using the truncQ =8? Or you do not recommend this at all?

I'd probably try it first just with truncQ=2, but you could experiment with truncQ from there.

yakshiUPR commented 4 years ago

Hello again,

I have some updates! As mentioned before, I am using the 5.8S-Fun (forward) and ITS4-Fun (reverse) primers (Taylor et al. 2016 - https://aem.asm.org/content/82/24/7217). They switched the Illumina's adaptors, so the forward and reverse reads are inverted.

You suggested that I use only the forward read. So, I tried using only the forward and only the reverse reads to run dada2. I did both approaches, since the forward reads are actually the reverse and vice versa.

I found some contrasting results. When I use the forward reads (reverse primer), the batch effect disappears. However, when I use the reverse reads (forward primer) the batch effect is still present. When I do the PCoA for this one, the separation of the datasets is along the second axis instead of the first one (as in the original post).

Thus, I have two additional questions: 1) I did not reverse the reverse reads to make them forward direction. Is this something that needs to be done before running dada2 or assign taxonomy? Or do you think that this batch effect in the reverse reads is more related to the sequencing quality?

2) Can I use the forward reads (reverse primers), to make the analyses, although these reads do not cover the initial part of the amplicon?

Thanks a lot for all your help!

Yakshi

benjjneb commented 4 years ago
  1. I did not reverse the reverse reads to make them forward direction. Is this something that needs to be done before running dada2 or assign taxonomy?

Not before running dada2. And for assignTaxonomy you can just give it the flag tryRC=TRUE and it will also try the reverse complement orientatio for taxonomy assignment.

Or do you think that this batch effect in the reverse reads is more related to the sequencing quality?

The fact that the batch effect seems to be specific to the reverse reads, and not the forward reads, certainly suggests it is related to sequence quality issues. It could just be the lower qualities of one run, or some unremoved artefact.

  1. Can I use the forward reads (reverse primers), to make the analyses, although these reads do not cover the initial part of the amplicon?

Yes, they are perfectly valid sequences. Given what you have seen (batch effect specific to using the reverse reads) this sounds like it might be the right approach.

yakshiUPR commented 4 years ago

Hi Benj, Thanks for all your help with the dada2 parameters with ITS samples. These were very helpful.

I hope you don't mind that I seize this opportunity to ask about a similar issue that I am having with some 16S samples.

In this case, I am working with 3 independent MiSeq runs (lets call these datasets: ds1, ds2, ds3). So, ds1 and ds2 have 96 samples each and were sequenced in the same facility with few months apart. ds3 on the other hand, was sequenced around two years after in another facility. This last dataset contained technical replicates of ds1 and ds2, as well as other new samples. I am getting a strong batch effect with the ds3 (Figure 1) emperor-16S Figure 1. Unweighted Unifrac PCoA. Colors correspond to the dataset (ds1 - red; ds2 - blue; ds3 - yellow).

Here are a few things that I noticed while studying the data and testing different dada2 parameters:

  1. The total number of sequences varied a lot among the ds1-2 and ds3. ds1 - ~9.7 million reads ds2 - ~7.1 million reads ds3 - ~1.7 million reads This is not supposed to affect the PCoA (which is rarefied), but I don't know if the sequencing depth of ds3 was so shallow that the species detected in this plate are different to those in ds1 and ds2.

  2. Along with the total number of reads, the total of number of features is also reduced in ds3.

  3. I tried different approaches for dada2 including changing the parameters (trunc-len and maxEE) and using pair-ends, only forward and only reverse. In all the different approached that I used, I always have the batch effect.

While using the pair-ends, I noticed that the ds1 and ds2 varied in the merged reads length (210-350 bp), while the ds3 length was fixed to 250bp (stdev 4bp).

Do you have any suggestions? Should I move forward without resolving this batch effect?

Thanks!

Yakshi

benjjneb commented 4 years ago

While using the pair-ends, I noticed that the ds1 and ds2 varied in the merged reads length (210-350 bp), while the ds3 length was fixed to 250bp (stdev 4bp).

How are you merging the datasets together? The ASVs will be different between ds3 and the other two if the merged reads are not the same length.

yakshiUPR commented 4 years ago

I will try to clarify.

  1. I used dada2 for each ds separately.
  2. I revised the length (F and R merged-reads) of each rep-seqs. The distribution of lengths of ds1 and ds2 varied, while ds3 rep-seqs length remain constant.
  3. I merged all three datasets using qiime2, and assigned taxonomy.

The outcome of these steps was the batch effect. The thing is that although there is a length variation among the datasets (and within ds1 and ds2) when using the pair-end reads, there might be another thing causing the differences in ASV in ds3. When I tried dada2 with only the forward reds, I checked and made sure that all rep-seqs were of the same length, and I still saw the batch effect.

benjjneb commented 4 years ago

The fact that you saw a different length distribution of ASVs in ds3 then you did in ds1/2 strongly suggests to me that there is some important technical choice, such as the primer set used or the sequencing instrument or PCR protocol, that is different as well. Could that be the case? If so, that is the most likely explanation for the batch effect, which could be caused by the different sensitivities of the methodologies to each taxon: https://elifesciences.org/articles/46923

Also, how are these datasets being merge in qiime2 when the identified ASVs are not the same length in ds1/2 as they are in ds3? On the basis of taxonomy?

yakshiUPR commented 4 years ago

Sorry for the delayed response. And thanks for all your help.

The fact that you saw a different length distribution of ASVs in ds3 then you did in ds1/2 strongly suggests to me that there is some important technical choice, such as the primer set used or the sequencing instrument or PCR protocol, that is different as well. Could that be the case?

The only thing, to my knowledge, that changed between ds1/2 and ds3 was the sequence facility. The instrument for all ds was MiSeq and the primers were the same in all datasets. The PCR protocol, I am not so sure. I did the first two datasets, but the third was done by the sequence facility. I am awaiting for their response on this matter. But I guess that the sequencing protocol is the same.

Also, how are these datasets being merge in qiime2 when the identified ASVs are not the same length in ds1/2 as they are in ds3? On the basis of taxonomy? Just by adding the rep-seqs of all datasets together. I assume that rep-seqs that potentially belong to the same ASV but have different seq length remain as different ASV. But I checked using with only forward or only reverse reads, and this length difference disappeared, but the batch effect is still present.

I checked your article Consistent and correctable bias in metagenomic sequencing experiments, and also the metacal R package. If I understood right, until this moment you can only correct for bias if you sequenced mock communities, right? I did not used mock communities, but I did include technical replicates. I saw that in the Author's Response section of the article using technical replicates to correct biases is an approach that might be developed in the future, but that with metacal is still not possible. Am I correct?

If this is true, then should I just keep going on without correcting for the bias? Or do you have any other suggestion?

Thanks again.

benjjneb commented 4 years ago

I checked your article Consistent and correctable bias in metagenomic sequencing experiments, and also the metacal R package. If I understood right, until this moment you can only correct for bias if you sequenced mock communities, right? I did not used mock communities, but I did include technical replicates. I saw that in the Author's Response section of the article using technical replicates to correct biases is an approach that might be developed in the future, but that with metacal is still not possible. Am I correct?

Correct, we do not yet have solutions for correcting bias in natural datasets. (But we are working on it!)

If this is true, then should I just keep going on without correcting for the bias? Or do you have any other suggestion?

I would dig in a little more to the difference in the ASV length distributions between ds3 and ds1/2. That is the single thing that is most suggestive to me of a fixable technical bias. For me, that would include getting a full explanation from the sequencing center that did ds3, but perhaps also looking at the most abundant ASVs in ds1/2/3 -- are they the same ASVs but different lengths perhaps?

yakshiUPR commented 4 years ago

I would dig in a little more to the difference in the ASV length distributions between ds3 and ds1/2. That is the single thing that is most suggestive to me of a fixable technical bias. For me, that would include getting a full explanation from the sequencing center that did ds3, but perhaps also looking at the most abundant ASVs in ds1/2/3 -- are they the same ASVs but different lengths perhaps?

I talked to the sequencing facility, and the protocol that they followed appears to be similar to the one we used. The only thing that I can think of is that in the sequence facility that I used for ds1 and 2, they only sequenced 96 samples per run (only my samples), while in the sequence facility of ds3, they sequenced more than 96 samples and not all of them where mine. There is nothing else I can think/know that might be causing these differences.

I would dig in a little more to the difference in the ASV length distributions between ds3 and ds1/2

I agree, there is something weird with ds3 when I try to merge the forward with the reverse reads. It does not makes sense that the reads length distribution are so different. This is an example of rep-seqs length distribution of all 3 datasets after using dada2 with a MaxEE of 0.5 and trimming F - 210, R - 155.

ds 2% Mean StDev 98% ds1 221 255 16 322 ds2 219 256 18 323 ds3 252 253 4 254

I obtain results similar to this one using different length parameters, which do not make sense.

However, I am not entirely sure that this is main issue. I tried dada2 using only forward or reverse reads and cut all reads (of all datasets) to the same length, but I still see the batch effect. I also tried collapsing the reads by taxonomy, and the batch effect is still present at all taxonomic levels, although it is less intense.

The top6 most abundant ASVs are shared among all ds. However, the relative abundances are somewhat different. For example, one of these ASVs belongs to the genus Rhodoplanes. The relative abundance of that ASV differs in ds1/2 from ds3 (higher abundance in this ds). However, when I take into consideration other ASVs that were also identified up to this genus, all three datasets have more or less the same relative abundance.

benjjneb commented 4 years ago

TBH at this point I think it is best to continue the conversation at the metacal package site, as this doesn't really appear to be a dada2 issue, but rather an issue with batch effects between sequencing runs, probably(?) driven by differences in the metagenomics bias between those runs.

yakshiUPR commented 4 years ago

@benjjneb,

Yes, I agree. Thank you very much for all your help. This whole conversation was very enlightening.

Yakshi