ccagc / QDNAseq

QDNAseq package for Bioconductor
47 stars 27 forks source link

Reproducible results? #96

Closed sxv closed 3 years ago

sxv commented 3 years ago

Hi,

I'm running some light-pass WGS cancer samples with QDNAseq_1.26.0. I noticed that after running the same exact sample through the pipeline three times, I am getting significantly different .seg/.vcf output results. One with 20 calls, one with 28, one with 64! I think this has something to do with the segmentBins (or smoothing) functions which rely on a random seed for parallelization? But I do not know how to confidently move to downstream analyses when the results can vary so widely. Any suggestions for how to handle this? Is there a way to run without relying on a seed?

Thanks, Sujay

HenrikBengtsson commented 3 years ago

Good questions. I think this is something that should be documented, especially which steps are relying on randomization.

I don't know the answer right now. However, if you have the spare time, you could help narrow them down by adding an "RNG tracker" to your R session and then run through the QDNAseq steps again. See https://www.jottr.org/2020/09/21/detect-when-the-random-number-generator-was-used/ for how to such a tracker. When activated, you'll get something like:

> x <- c(1, 4, 3, 2)
> rank(x)
[1] 1.0 2.5 2.5 4.0
> rank(x, ties.method = "random")
[1] 1 3 2 4
TRACKER: .Random.seed changed
> 
ilarischeinin commented 3 years ago

If my memory serves me right, it’s DNAcopy::segment().

HenrikBengtsson commented 3 years ago

I've enabled the RNG tracker and gone through the QDNAseq, and can confirm that the only step that rely on random-number generation (RNG) is the segmentBins() method. (This confirms what @ilarischeinin says above; segmentBins() calls DNAcopy::segment(), which rely on RNG.)

As a starter, I've added a help section 'Numerical reproducibility' to ?QDNAseq::segmentBins (https://github.com/ccagc/QDNAseq/blob/898c11471ec1102f69eeb53a83f16850f55e63d7/R/segmentBins.R#L53-L59). I've pushed this to Bioconductor devel as v1.29.4.

... But I do not know how to confidently move to downstream analyses when the results can vary so widely.

That's the world of statistics we live in. The authors of CBS decided that an estimator that rely on randomization was the best choice. That's not uncommon in estimators. Results are still statistically reproducible. To understand the details, I think the CBS papers and the code of the DNAcopy package are the best sources.

Any suggestions for how to handle this? Is there a way to run without relying on a seed?

If you want numeric reproducibility, then, yes, fixing the seed upfront (only once, and on top of your script) is the way to go.

If you think the variation in segmentation results is too large, you can bootstrap, say, 100 segmentations and take the "average". This will lower the variation, but there will still be randomness. How to "take the average" of many segments/calls is not obvious and probably a research project by itself. Also, note that even if the number of calls might vary, they might be from teeny segments or borderline calls, so they might not differ that much.