barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
137 stars 21 forks source link

Use of term "dispersion" in coverage summary reporting is confusing #325

Closed rohanmaddamsetti closed 1 year ago

rohanmaddamsetti commented 1 year ago

Hi Jeff,

The use of "dispersion" and its definition as Var(coverage)/Mean(coverage) in the summary output file is correct-- but the R documentation in running "?qnbinom" and the wikipedia article in the negative binomial article (https://en.wikipedia.org/wiki/Negative_binomial_distribution#Alternative_parameterizations) both use "dispersion" to define a very different quantity: size (dispersion) = Mean(X)^2/(Var(X) - Mean(X)).

I was wondering if you could consider using a different term like "relative variance" in the breseq output to avoid this ambiguity: https://en.wikipedia.org/wiki/Index_of_dispersion

I was having trouble reproducing the negative binomial coverage distribution in R, because it's tempting to incorrectly write:

rnbinom(n=genome.length, size=breseq.dispersion, mu=breseq.mean)

because R documentation says that "size" is the "dispersion parameter" in one of the many(!) parameterizations of the negative binomial. Wikipedia also describes this usage. But the proper parameterization in R requires the following transformation:

rnbinom(n=genome.length, size=(breseq.mean^2)/(breseq.mean*breseq.dispersion-breseq.mean), mu=breseq.mean)

I figured this out after a lot of head-scratching, but I imagine the use of "dispersion" to indicate different quantities related to the negative binomial distribution, depending on author/source, may trip others up as well. (I'm hand-rolling some CNV analyses for my data, which is why I'm digging into this stuff).

Thanks, Rohan

jeffreybarrick commented 1 year ago

Hi Rohan,

Sure, that is better terminology and might save someone else from re-figuring this out. I changed dispersion to relative variance throughout the code. That will be incorporated in the next release (>v0.37.1).