benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
463 stars 142 forks source link

Subsampling a fraction of reads across all samples for error rate estimation #100

Closed elsherbini closed 7 years ago

elsherbini commented 8 years ago

Mentioned in another issue here, it seems the best practice way to estimate the error rates on a subset of the data is to pick some number of samples that should add up to some fraction of the total reads.

I'm wondering if there is a convenient way to get a random sample of reads across all samples. @joey711 mentioned here that it might be important to be careful in doing this, so as to get a representative sample.

My naive idea would be to simply cat all the sample fastqs together and then pick at random, maybe using GNU utils shuf command. If you want to make sure you sample matched pairs of reads, there might be slightly more code involved, but it'd be the same principle.

elsherbini commented 8 years ago

For example, to get a random sample of paired reads using the command line, following this Biostars answer as a guide:

paste $(cat *_R1_*) $(cat *_R2_*) |\ #merge all the samples and zip them together
awk '{printf("%s", $0); n++; if (n%4==0) {printf("\n");} else {printf("\t\t");}}' |\  # make each read be on one line
shuf | head -n 5000000 |\ #shuffle and take the first 5 million
sed 's/\t\t/\n/g' |\ # put records back in four line format
awk '{print $1 > "forward_random_sample.fastq"; print $2 > "reverse_random_sample.fastq"} #split into forward and reverse again

It also looks like the ShortRead library has a function for doing this call FastqSampler, so it could be done from within R without too much coding.

joey711 commented 8 years ago

Thanks for posting this as a new issue, and referencing the previous.

I want to emphasize that my concern is not the mechanics of doing this, awk, command-line, ShortRead::FastqSampler (personally I use the latter, but I'm a BioC/R fan)... these are all pretty trivial to setup if all you needed is a uniformly distributed downsampling of the complete set of reads that you have amongst all the files you're about to process. Unfortunately, there are obvious constraints beyond read-pairing. For example, the reads need to be from the same sequencing run. We've seen in enough data that the error-rates are different for different seq runs, even when from the same sample. The fact that DADA2 algorithm can estimate these rates from the data is one of the major advantages, after all. Knowing that this changes from run-to-run is critical.

There is another, more tedious constraint that must be addressed, which has to do with the structure of the data itself. Different samples from vastly different ecosystems -- e.g. feces and ocean -- will have very different real sequences in them. If you randomly sample only a shallow number of sequences, say, 10 each from 1000 samples on the same run, where each sample is from many different environments, this total number of sequences might satisfy a useful rule-of-thumb for the total required sequences to estimate error-rates based on the diversity of any individual environment in the set (e.g. 10,000 reads works for V4 for most microbiomes that are less than or as complex as tool) -- but the effective diversity of this shallow-sampled-set is likely much higher than any individual sample. This (perhaps drastically) increases the number of total reads you'd likely need to make the same stable and reliable error-rates estimation. Which is to say, this restructuring of the downsampling procedure hasn't won you anything, it probably lost you something, and it isn't clear how it will behave for microbiomes whose diversity is not yet known.

Since this is the case, and it would take more time/effort to evaluate the appropriateness of the number of reads from each sample for your experiment / sequencing run, it begs the question I asked previously, "Why do you want to do it this way?". What is motivating the question? What problem are you trying to solve? Are you concerned a subset of your samples have different error behavior than others even though they are on the same run? Are you concerned that picking only a few samples to represent the larger group is not representative?

I recommend answering these latter questions before delving any further into how the procedure might be accomplished.

Finally, the procedure being discussed here -- using the same error matrix across multiple samples -- is the same as assuming that each of those samples was generated with that same underlying error model (but different "real" sequences and abundances). You can check this assumption in your data. For example, let dada2 run in selfConsist=TRUE mode, one sample at a time for some subset of samples, or all of them if you have the time, provided they are all from the same sequencing run. Then compare these separate estimates of each error rate from the same run. They should be very consistent within some expected variance. Hopefully this is the case for your data. The fact that these converge to stable rates with fewer samples and fewer reads than you might have in your full dataset is most of the point, allowing you to avoid unnecessary convergence computation. The other big reason to use pan-sample error-rates is if some of your samples are too small on their own to achieve reliable estimation. In those cases, the dada2 result based on the global error-rates will be more accurate.

Hope that helps.

joey

benjjneb commented 8 years ago

There is another, more tedious constraint that must be addressed, which has to do with the structure of the data itself. Different samples from vastly different ecosystems -- e.g. feces and ocean -- will have very different real sequences in them. If you randomly sample only a shallow number of sequences, say, 10 each from 1000 samples on the same run, where each sample is from many different environments, this total number of sequences might satisfy a useful rule-of-thumb for the total required sequences to estimate error-rates based on the diversity of any individual environment in the set (e.g. 10,000 reads works for V4 for most microbiomes that are less than or as complex as tool) -- but the effective diversity of this shallow-sampled-set is likely much higher than any individual sample. This (perhaps drastically) increases the number of total reads you'd likely need to make the same stable and reliable error-rates estimation. Which is to say, this restructuring of the downsampling procedure hasn't won you anything, it probably lost you something, and it isn't clear how it will behave for microbiomes whose diversity is not yet known.

Joey raises a good point here. If random reads rather than random samples are used to estimate the error rates, there is the risk of significant over-estimation of error rates because of the inability to statistically identify the variants spread across many shallowly-sequence samples.

In practice it would probably work better than the worst case scenario, but in practice using samples to estimate error rates probably just works better. This is especially true because because the reads in a sample are mixed uniformly in a sequencing run.

Leaning towards closing this as not an improvement, but will leave open for another day or two to potentially be convinced otherwise.

elsherbini commented 8 years ago

Two reasons I think this feature would be good:

  1. To help prevent "jackpotting" with a bad sample that may have had contamination or bad prevalence of chimera due to template concentrations/ pcr cycle miscalculation, etc. If you have 500 samples in your sequencing run and you estimate the error on 50 of them and happen to get a bum sample in there, it's being over represented for the error.
  2. Sample size can vary orders of magnitude, from 100 reads in a sample to 1,000,000. Trying to reason about how many samples to add is hard when you get a distribution of the number of reads rather than a fixed number. It be great to say `estimateError(fns, 0.1) and just know that you are using 10% of the reads.

My intuition could be off, but it seems to me that subsampling reads from the sequencing run carries with it fewer assumptions than subsampling samples.of course, if you have samples from vastly different experiments, it could make sense to partition the samples up front.

joey711 commented 8 years ago

Your intuition is definitely off. Not that this is very intuitive. It's not. So thanks for being honest. I'm guessing this post will help other users as well. Perhaps reread my explanation, or even better yet, read Ben's concise description of the problem more carefully.

Point (1). Your point about jackpotting is a good one, though. Ben and I are not arguing for using only a pointlessly-small number of samples that leave you very worried about effects like outliers. There's no point. You should be able to convince yourself (by comparing with sample-wise error rates) that the rates are converging around representative values for that seq run. If you have reason to believe a sample has a problem, or you see later in the data analysis that it exhibited unreliable, artifact-suggestive results, you probably want to make sure that you didn't use it as one of a small number of samples for the error rates estimation. Provided you didn't get unlucky and pick terrible samples from an otherwise good sequencing run, it shouldn't matter which samples you pick. That's the part that might not be immediately intuitive. The error rates should be consistent across samples regardless of their origin, provided they are in the same sequencing run and there weren't other major confounders that you'd expect to generate different types of errors in an amplicon-sequencing experiment.

Point (2). Sample size variation is not an issue here on its own. In some cases a very small sample size is indicative of a failed sample/prep/seq, and its data should be removed from the dataset. I can see I wasn't clear enough in my description, partly because I was emphasizing that you should stick with the sample-wise subsetting. However, in the example you're describing, the 100-read sample would not be useful for estimating error rates on its own (for most microbiomes), while the 1-million read sample would probably do a very good job by itself in returning a reliable error-rate estimation that works for the rest of the reads in that run. In practice, 1-million read sample might be a sample that was run by itself. But to finish the example anyway, the 1-million read sample might have its error-rates well estimated by a uniform downsampling of 20,000 reads. The actual minimum read depth for convergence depends on the underlying data, but 20,000 is not a bad default value for MiSeq-length V4 data from fecal data, for example. You can always try a few different downsampled depths from that large file, and check at which level the error-rates converge. For most microbiomes they would have converged by that size. This is also a way of controlling the computation time (and peak memory? @benjjneb ?) required even when you have very large read-depths per sample.

Re-summary, hopefully more clear:

The point is that too-shallow a sampling from each file (sample/specimen) is clearly a problem, and one that can be avoided by using enough of the reads from each sample included in the error-rate estimation step. How many reads will depend on the data, though, and the typical depth per sample used in current experimental designs is already enough in most cases. Hence the utility of doing it sample-wise. It allows you to avoid (most of) the worrying about choosing/evaluating this per-sample-size parameter.

elsherbini commented 8 years ago

Thank you for taking time for such a thorough explanation. At risk of sounding obtuse, I'm still confised. If you could explain the intuition using this extreme example, it might help me see better: why is 1 read from a million samples not the same as a million reads from one sample for the estimation?

I'm talking about creating one giant "meta sample" to do the error rate estimation, with reads uniformly subsampled from the original data set. Again, I must be missing the intuition on why this risks over estimating the true error rates.

On Aug 4, 2016 4:41 PM, "Paul J. McMurdie" notifications@github.com wrote:

Your intuition is definitely off. Not that this is very intuitive. It's not. So thanks for being honest. I'm guessing this post will help other users as well. Perhaps reread my explanation, or even better yet, read Ben's concise description of the problem more carefully.

Point (1). Your point about jackpotting is a good one, though. Ben and I are not arguing for using only a pointlessly-small number of samples that leave you very worried about effects like outliers. There's no point. You should be able to convince yourself (by comparing with sample-wise error rates) that the rates are converging around representative values for that seq run. If you have reason to believe a sample has a problem, or you see later in the data analysis that it exhibited unreliable, artifact-suggestive results, you probably want to make sure that you didn't use it as one of a small number of samples for the error rates estimation. Provided you didn't get unlucky and pick terrible samples from an otherwise good sequencing run, it shouldn't matter which samples you pick. That's the part that might not be immediately intuitive. The error rates should be consistent across samples regardless of their origin, provided they are in the same sequencing run and there weren't other major confounders that you'd expect to generate different types of errors in an amplicon-sequencing experiment.

Point (2). Sample size variation is not an issue here on its own. In some cases a very small sample size is indicative of a failed sample/prep/seq, and its data should be removed from the dataset. I can see I wasn't clear enough in my description, partly because I was emphasizing that you should stick with the sample-wise subsetting. However, in the example you're describing, the 100-read sample would not be useful for estimating error rates on its own (for most microbiomes), while the 1-million read sample would probably do a very good job by itself in returning a reliable error-rate estimation that works for the rest of the reads in that run. In practice, 1-million read sample might be a sample that was run by itself. But to finish the example anyway, the 1-million read sample might have its error-rates well estimated by a uniform downsampling of 20,000 reads. The actual minimum read depth for convergence depends on the underlying data, but 20,000 is not a bad default value for MiSeq-length V4 data from fecal data, for example. You can always try a few different downsampled depths from that large file, and check at which level the error-rates converge. For most microbiomes they would have converged by that size. This is also a way of controlling the computation time (and peak memory? @benjjneb https://github.com/benjjneb ?) required even when you have very large read-depths per sample.

Re-summary, hopefully more clear:

The point is that too-shallow a sampling from each file (sample/specimen) is clearly a problem, and one that can be avoided by using enough of the reads from each sample included in the error-rate estimation step. How many reads will depend on the data, though, and the typical depth per sample used in current experimental designs is already enough in most cases. Hence the utility of doing it sample-wise. It allows you to avoid (most of) the worrying about choosing/evaluating this per-sample-size parameter.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/100#issuecomment-237677027, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkgvZYlj6Lb4vqruE8R_hS6dAP4vltQks5qck6CgaJpZM4JbNw3 .

benjjneb commented 8 years ago

@elsherbini

Let me elaborate slightly on your extreme example.

Let there be a million samples, each with completely non-overlapping communities, and 1 read from each of them pooled together.

Error rate estimation on this 1 million read, all unique pool would discover one real biological sequences, and assign the other 1M-1 reads as errors, because without any repeated observations there isn't enough statistical evidence to infer additional real variants. As a result, the error rates would be highly overestimated.

Obviously that is extreme, and in practice 99% of the time, randomly sampling reads would be fine. However, randomly choosing samples is is likely to be marginally better because (1) there is better statistical power by keeping samples together and (2) the reads from samples are themselves uniformly distributed within a sequencing run.

I do like your idea of a simplified interface for estimating error rates, something like

estimateErrorRates(derepFs, nreads=1000000)

This function would take care of randomly picking samples up to the number of requested reads, and return the matrix of error rates, which I think is more intuitive than what we do now.

elsherbini commented 8 years ago

Yep, the extreme example helped me understand the potential problem. Thanks for your patience! I'm planning to use dada2 for a lot of fairly large datasets, so I want to make sure I am using it correctly and efficiently.

joey711 commented 8 years ago

Thanks for your questions! I'd rather you keep at it and eventually figure it out, then just give up because we didn't explain it well enough.

I'm not sure where Ben got "99% of the time". I'm guessing he's assuming that most experiments are going to include samples that are relatively similar and have a lot of the same sequences, thus giving some chance for a true sequence and its errors to be observed enough times to estimate the error rates.

I do like the extreme "one sequence per sample" example. It drives home the point that you need the repeated observations of the same sequence, including the ones with errors, for this to work. If you structurally exclude (or reduce, if not extreme) repeated observations of reads derived from the same real sequence, then you're going to do a worse job estimating the error rates.

The good news is that it seems to have inspired Ben to add a feature to help with the details of doing this "error-rate pooling". I'm also excited about that feature, and I was going to propose it until I noticed that Ben was already on it. Nice work, @benjjneb !

Cheers

benjjneb commented 7 years ago

learnErrorRates added in 85b3498ff112395488f3de25fe439c1f8f8fb2e6

Implements "picking samples up to the number of requested reads, and return the matrix of error rates". plotErrors was extended somewhat to continue to work with the output of learnErrorRates.