benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 142 forks source link

Combining dada2 runs #716

Closed claireonthecouch closed 5 years ago

claireonthecouch commented 5 years ago

I have output from multiple sequencing runs that I want to analyze together, and some data from the same sequencing runs that I want to analyze separately. So my questions are: 1) should I combine datasets from different sequencing runs before or after running dada2? 2) should I split up data that I want to analyze separately (I.e. from different studies) before or after running dada2?

benjjneb commented 5 years ago

should I combine datasets from different sequencing runs before or after running dada2?

After. Just make sure to trim each run to the same gene region (i.e. same trimLeft for merged paired end data, and same trimLeft and truncLen for single-end data) to allow merging later.

should I split up data that I want to analyze separately (I.e. from different studies) before or after running dada2?

Probably after. You would only split before if you think the error process is different between the studies (perhaps because of different PCR protocols) as that is what matters to dada2.

claireonthecouch commented 5 years ago

Thank you! What are the reasons/justification for processing the data in this order? Is the error rate estimation dependent on sequencing run?

benjjneb commented 5 years ago

Is the error rate estimation dependent on sequencing run?

Yes it can be. If you are running your own samples on your own machine, probably the same run-to-run, but different machines can have different empirical relationships between the assigned quality scores and the error rates, hence it is recommended to learn the error rates on a run-by-run basis.

taramclancy commented 5 years ago

I have a related question, in my case I'm looking to merge reads from three sequencing runs targeting the V4-V5 region. run 1 was 2x250 bp and had poor quality reverse reads. The other two runs were 2x150 bp. My approach is to use only the forward read of run 1 with truncLen = 240 bp. Then for the runs 2 and 3, I use trimLeft to get the merged reads to be right around 240 bp (most are at 240 bp, but there are a few (~100 sequence variants out of 11,000 in total) that are +/- 2 bp from 240). This leaves me with two questions:

1) Is there a problem with having slightly variable read lengths (mostly +/- 2 bp from 240) from two of the runs, while for the first run, variants are at maximum 240 bp?

2) Is there a way to truncate the length after reads are merged?

The case that I've been thinking about is there may be two sequence variants that have exactly the same sequence for the 240 bp, but since one is 1 bp longer then it would be identified as a unique variant?

benjjneb commented 5 years ago

I have a related question, in my case I'm looking to merge reads from three sequencing runs targeting the V4-V5 region. run 1 was 2x250 bp and had poor quality reverse reads. The other two runs were 2x150 bp.

Are you perhaps using the V4 region? V4V5 is longer than 300 nts, you won't get reads to merge at all.

My approach is to use only the forward read of run 1 with truncLen = 240 bp. Then for the runs 2 and 3, I use trimLeft to get the merged reads to be right around 240 bp (most are at 240 bp, but there are a few (~100 sequence variants out of 11,000 in total) that are +/- 2 bp from 240).

I don't think you want to use trimLeft at all for your paired reads. Most importantly, you want the reads from both runs to be starting in the exact same place. If you use trimLeft on one of the runs, that won't be happening.

Is there a way to truncate the length after reads are merged?

Sure, just truncate the reads in your sequence table: colnames(st) <- substr(colnames(st), 1, 240)

The case that I've been thinking about is there may be two sequence variants that have exactly the same sequence for the 240 bp, but since one is 1 bp longer then it would be identified as a unique variant?

That is true, but the collapseNoMismatch function will fix this for you. The purpose of that function is for cases like this. It will collapse together variants with no mismatches or internal indels but that could differ by terminal gaps, i.e. variants that differ by their lengths and nothing else.

taramclancy commented 5 years ago

yes, sorry this is the V4

I don't think you want to use trimLeft at all for your paired reads. Most importantly, you want the reads from both runs to be starting in the exact same place. If you use trimLeft on one of the runs, that won't be happening.

I think I should clarify, I used trimLeft on the reverse reads to trim 13 bp (without trimming my merged lengths were 253, so by adding this trimming I get 240 bp merged reads, and if I'm following correctly I am trimming off the start of the reverse read so this would result in keeping the 240 bp that would correspond to the 240 bp lengths of the forward only reads from my run 1 and they would all start at the same place?

However, maybe the better approach is to just merge the reads without trimming and truncate with substr as you mention above? Then perform collapseNoMismatch?

benjjneb commented 5 years ago

I think I should clarify, I used trimLeft on the reverse reads to trim 13 bp (without trimming my merged lengths were 253, so by adding this trimming I get 240 bp merged reads, and if I'm following correctly I am trimming off the start of the reverse read so this would result in keeping the 240 bp that would correspond to the 240 bp lengths of the forward only reads from my run 1 and they would all start at the same place?

I see, that is probably OK then. It doesn't fix everything though because there is length variation in the V4 locus (i.e. some biological V4 segments are e.g. 252 or 255 nts long).

However, maybe the better approach is to just merge the reads without trimming and truncate with substr as you mention above? Then perform collapseNoMismatch?

That's probably what I would do. But you can probably just merge the data you already have and run collapseNoMismatch, I think you'll get nearly identical output either way.

shreyaskumbhare commented 5 years ago

Hi Ben @benjjneb ,

Although I have read your comments from different threads on combining the data from two different sequencing runs, I will really appreciate if you can clarify my following doubt:

I have data from two different MiSeq runs and I understand that the pre-processing of these two runs should be carried out separately and merged later before the chimera removal step. I am concerned about the TruncLen function. As the quality profile for both these runs is different, I have to use a different TruncLen based on the quality for both these runs. My question is, is this difference still fine to proceed with the downstream analysis? (I have made sure that I use the same TrimLeft value for both, as the primers and targeted region is the same).

Can you please let me know what would be a good option with some explanation. Thank you!

benjjneb commented 5 years ago

It's fine to use different truncLen parameters as long as the read pairs still overlap and are mergable at the end. The truncLen setting doesn't affect the merged amplicon region, it just affects the amount of overlap between the two reads.

shreyaskumbhare commented 5 years ago

Thank you! I will proceed with truncating them as per the quality profiles for the specific run and will check if there is still a good overlap and that I receive a good amount of merged reads.

ctekellogg commented 4 years ago

Hi there, this thread has been quite helpful; but I have 2 follow up questions.

(1) I work with microbial time series data so data is always streaming in and thus working with ASV's provides many advantages to having to re-cluster with OTU-based methods. so, thanks! but, with this, run quality varies among runs. so, i just want to confirm one quick thing. i gather from above that it is okay to use different trunLen settings. Is it okay for trimLeft and maxEE settings to be different among runs as well. This would be lovely, given some of my runs are a bit noisier than others and require a bit more of trim-left of read2 than others likely do. Or maybe I need to continue to do this and always use settings that work for the noisiest run and apply them across all runs (this is what i have been doing, but reading this thread makes me question if i actually do need to do this)?

(2). I have been using the QIIME2 implementation of dada2. I don't think the collapseNoMismatch function is available in that implementation, but i think it will come in handy if I don't have to trim and truncate every run exactly the same over the next several years. Could i just export the output tables, import them into R, and then do this step? Those tables look like this:

Screen Shot 2020-01-29 at 12 40 38 PM

Would I need to change the OTUID column to something else?

Thank you so much. I have been puzzling over this for a while and am happy to have found this thread!

Colleen

ctekellogg commented 4 years ago

Regarding this question - i think i figured out that i need to covert the MD5 Hashes to actual sequences for the R implementation of dada2 and am testing some ways to do this in a streamlined manner in QIIME2.

(2). I have been using the QIIME2 implementation of dada2. I don't think the collapseNoMismatch function is available in that implementation, but i think it will come in handy if I don't have to trim and truncate every run exactly the same over the next several years. Could i just export the output tables, import them into R, and then do this step? Those tables look like this:

Screen Shot 2020-01-29 at 12 40 38 PM

Would I need to change the OTUID column to something else?

Which I guess just leaves the first part of my question above remaining to be answered (to the degree that it can actually be answered).

Thank you!

colleen

benjjneb commented 4 years ago

(1) I work with microbial time series data so data is always streaming in and thus working with ASV's provides many advantages to having to re-cluster with OTU-based methods. so, thanks! but, with this, run quality varies among runs. so, i just want to confirm one quick thing. i gather from above that it is okay to use different trunLen settings. Is it okay for trimLeft and maxEE settings to be different among runs as well. This would be lovely, given some of my runs are a bit noisier than others and require a bit more of trim-left of read2 than others likely do. Or maybe I need to continue to do this and always use settings that work for the noisiest run and apply them across all runs (this is what i have been doing, but reading this thread makes me question if i actually do need to do this)?

What I would strongly recommend is that you keep the same trimLeft parameters throughout. Typically, trimLeft should only be used to remove primer sequence (or other technical bases) present at the starts of reads, it should not generally be used as a form of quality control. This is most important, because as long as trimLeft stays the same (and you are using paired end data that you are merging) the final sequences you end up with will cover the same genetic region, i.e. they will span from the forward primer to the reverse primer, and thus can be simply merged between studies and datasets.

maxEE and truncLen can be modified from dataset to dataset as important, because (as long as truncLen remains long enough to merge) they do not affect the genetic region of the final merged sequence -- they just affect how the data is filtered. Hence, variation in those doesn't preclude simple merging between datasets.

ctekellogg commented 4 years ago

Hi! Thank you for the quick response and explanation.

I removed my primers via cutadapt. The reason I was planning to use a trimLeft(0,25) setting is due to the following: (forgive that these are from qiime2. :)) Screen Shot 2020-01-28 at 12 03 09 PM

2 of my 6 MiSeq runs have this Quality drop at the beginning of R2, so I thought it would be good to trim that off (and thus, trim for all 6 runs). Would you tend to agree or would these runs still merge without this trimming (guess i didn't even give it a shot since without the trimming...perhaps i should)?

Thank you so much for your advice! Colleen

benjjneb commented 4 years ago

That is an ugly dropoff in quality early in R2.

Yes, it might be appropriate to enforce trimLeft=c(0,25) in this case. You'll just want to make sure you do the same in every dataset you want to simply merge together with these, even if those future datasets don't have this same dropoff.

ctekellogg commented 4 years ago

Okay. Thanks! And yes, frustrating and nastry dropoff indeed!

But alas, since these samples are part of a time series, I don't want to drop the 600 samples from the dataset so I guess trimming is the only option short of resequencing....

Thanks again. I very much appreciate your thoughts (and validation of my intuition!).

Colleen

blancaverag commented 4 years ago

I still have some questions regarding merging different runs from dada2.

Particularly, I am interested in understanding exactly why the same trimLeft parameter should be used in all runs. From what I read in the description of collapseNoMismatch, I thought that the function would deal with different lengths in either end ("This function can be thought of as implementing greedy 100% OTU clustering, with end-gapping is ignored"). I guess this is not the case and that is why the trimLeft parameter should be common to all runs, is that it?

But then, since

trimLeft should only be used to remove primer sequence (or other technical bases) present at the starts of reads, it should not generally be used as a form of quality control.

If one would need to remove primers from run A but reads from run B would come with primers already removed, for example, using the same trimLeft on the merged pairs of both runs would generate different starts and would lead to problems to obtain the expected behavior of collapseNoMismatch, am I wrong? So in that case, there would be no need to use trimLeft in both runs, right? Is there any way to effectively check that the start of the merged reads one feeds collapseNoMismatch with is the same? Would the program warn of that? Would it deal with 1-2 nt differences in the left end or it requires the start to be exactly the same?

Thank you in advance!!

benjjneb commented 4 years ago

@blancaverag I think you understand everything pretty well actually.

For simple merging (i.e. using mergeSequenceTables) you want the ASVs in each sequence table to start and end at the same point. Thus, usually you want to trim your sequences so that in the end you get sequences that span from primer to primer (w/o including the primers themselves).

So in that case, there would be no need to use trimLeft in both runs, right?

Yep, you would just use it in the first case. When we say that trimLeft should be the same in all cases, we are implicitly assuming all the data was from the same library preparation. But if there is a mix of w/ primer and w/o primer libraries, what you suggest doing is exactly right.

Is there any way to effectively check that the start of the merged reads one feeds collapseNoMismatch with is the same? Would the program warn of that? Would it deal with 1-2 nt differences in the left end or it requires the start to be exactly the same?

These kinds of details are why it's ideal to have ASVs already trimmed to span the same sequence region. I don't think that collapseNoMismatch will consistently collapse onto the ASV starting at a fixed nucleotide position. I think it might just pick the most abundant ASV to collapse length variation into. Although I'd have to check the code to make sure! Almost certainly won't be consistent in the sense of always collapsing onto the variant starting at exactly 16S position XX though.

ghost commented 4 years ago

Dear @benjjneb, this is indeed a very useful thread, though I still have a follow-up question: I too have samples from two Illumina MiSeq runs (~300 samples each), all samples use the same primer pair (515F, 926R) and I would like to merge them into one phyloseq object. Originally i had run both sequencing batches independently through dada2 pipeline (1.14), using the same trimLeft (=24) for both , informed by cutadapt. I then wrapped them both independently to phyloseq and tried merging them with merge_phyloseq before further processing and abundance filtering and such. However, I was surprised to find out that these two runs only shared roughly 40% of each other ASVs, even though the samples are expected to be very similar. This made me wonder whether I should have merged the seqtables already before chimera removal and taxonomy assignment, right after dada-algorithm and before any phyloseq business? I'm a bit lost in what would be the reasonable order of doing all this.

Thanks so much for any help on this!

benjjneb commented 4 years ago

I should have merged the seqtables already before chimera removal and taxonomy assignment, right after dada-algorithm and before any phyloseq business?

Yes. Particularly before chimera removal, so as to make consistent chimera identifications across the batches.

However, I was surprised to find out that these two runs only shared roughly 40% of each other ASVs

More important is the fraction of reads in the shared ASVs. This should be much higher if the samples are indeed supposed to be very similar.

ptpell77 commented 4 years ago

Hi, This thread has been enormously helpful. I am working with an array of 454 and Illumina sequences for ITS data, using Forward reads only with the same F primers. My question requires a clarification of a sliding trimLeft cutoff based on highly variable sequence profiles among runs (Illumina vs. Illumina) & (Illumina vs. 454).

However for unmerged ITS reads, the amplicon length is biologically variable, and even if the ASV's all start at the same position, "having ASV trimmed to span the same sequence region" seems to negate the directives of trimming to the same length of ITS reads. The ultimate goal is to use the collapseNoMismatch command to combine the ASV files. @claraqin

benjjneb commented 4 years ago

@ptpell77

What you want is for, after truncation, the same biological sequence to results in the same sequencing reads. Two ways to do this -- truncate at a fixed position after the end of the forward primer, or truncate at the start of the reverse primer.

For ITS data, especially forward read only data, you may need to do both because of the length variability in that locus, which will cause some reads to extend to the reverse primer while others do not reach it.

One way to do this would be to truncate your 454/Illumina reads to a common length (maybe 250nts for everything, or less depending on where quality crashes), and then to do a second step in which you use an external program to remove the reverse primers from the ends of the reads (when they exist). Our ITS tutorial shows how to do that part with cutadapt: https://benjjneb.github.io/dada2/ITS_workflow.html

ptpell77 commented 4 years ago

Hi Ben, Thanks greatly for this explanation. Sorry to belabor the point, but could you explain why collapseNoMismatch dosen't perform as well if the amplicons are of different lengths, and/or the ASV are derived from a slightly different coding region?

benjjneb commented 4 years ago

why collapseNoMismatch dosen't perform as well if the amplicons are of different lengths, and/or the ASV are derived from a slightly different coding region?

collapseNoMismatch will still perform as expected, but the underlying denoising will perform with different sensitivities if you are trimming to different gene regions. So, you are making a batch effect that doesn't need to be there by that approach. That's why we recommend against that... much better to trim to the same gene region and work that way instead.

luigallucci commented 3 days ago

Hi @benjjneb,

I'm adding an additional comment to this thread because is not totally clear for me. collapseNoMismatch is performing a kind of dereplication after the merging of two seqtab? In case this is not correct, how this affect the seqtab?

benjjneb commented 2 days ago
?collapseNoMismatch

collapseNoMismatch {dada2} R Documentation

Combine together sequences that are identical up to shifts and/or length.

Description

This function takes as input a sequence table and returns a sequence table in which any sequences that are identical up to shifts or length variation, i.e. that have no mismatches or internal indels when aligned, are collapsed together. The most abundant sequence is chosen as the representative of the collapsed sequences. This function can be thought of as implementing greedy 100% OTU clustering with end-gapping ignored.

luigallucci commented 2 days ago

Thank you for the reply.

So, is suggested to use after merging different run to avoid redundancy?

benjjneb commented 2 days ago

So, is suggested to use after merging different run to avoid redundancy?

It is only suggested if you are merging runs together that, due to primer choice of trimming choices, have ASVs that start/stop at different positions than each other.

If the ASVs from each run start/stop at the same place (e.g. at the positions defined by a common primer set) then collapseNoMismatch is not needed and not recommended.