Add negative binomial distribution

bvenn commented 1 year ago

[x] add distribution properties
[x] add tests
[x] add documentation

bvenn commented 1 year ago

Digging into negative binomial distribution implementations turned out to be a rabbit hole. Many packages as R - dnbinom or Python - scipy.stats.nbinom model the PMF according to provided success and failure counts.

Negative Binomial distribution

The distribution models the number of trials needed x to get the r^th success in repeated independent Bernoulli trials. Until the (x-1)^th trial, (r-1) successes must be accomplished, which matches the standard binomial distribution. Therefore, to get the r^th success in the x^th trial, you have to multiply Binom(p,x-1,r-1) by p. Therefore the PMF is:

probability for r-1 successes in x-1 trials = binomial distribution: $\binom{x-1}{r-1}p^{r-1}(1-p)^{(x-1)-(r-1)}$
probability for 1 success in the last trial: $p$
combined probability: $p*\binom{x-1}{r-1}p^{r-1}(1-p)^{(x-1)-(r-1)}$
simplified version: $\binom{x-1}{r-1}p^{r}(1-p)^{x-r}$

with a support of [r -> infinity).

Example

What is the probability for the third success occurring at the 10^th trial, given a independent trial success-probability of 0.09?

x=10
r=3
p=0.09

NegBinom(x,r,p) = 0.01356

However, standard R and Python functions result in: R: dnbinom(x=10,size=3,prob=0.09) = 0.01873637 Python: scipy.nbinom.pmf(k=10, n=3, p=0.09) = 0.01873637

Scipy-documentation states that

The probability mass function above is defined in the “standardized” form. To shift distribution use the loc parameter. Specifically, nbinom.pmf(k, n, p, loc) is identically equivalent to nbinom.pmf(k - loc, n, p).

k often is defined as the number of failures prior to the last success (Wikipedia top right or this online calculator). By changing the function accordingly:

scipy.nbinom.pmf(k=10, n=3, p=0.09, loc=3) or scipy.nbinom.pmf(k= 7, n=3, p=0.09) = 0.01356

the expected probability is returned.

Conclusion

With the definition given above, there is no possibility to have probabilities > 0 when x<r. Therefore I would suggest to parameterize the negative binomial distribution using:

r: number of successes
x: number of trials and not number of failures
p: probability of each independent Bernoulli trial

and stick to these parameters for PMF and CDF accordingly. Switching to number of failures for the determination of PMF does not make sense for me. To my current overview this does not align to other implementations, so some further research has to be done to clarify the situation. However, overloads could be introduced to support both definitions. The parameter usage must be well defined.

References for X = number of trials:

References for X = number of failures:

References that support both definitions:

https://molbiotools.com/math_calculators/negative_binomial.html

@muehlhaus, maybe you have time to have a look at this issue

bvenn commented 1 year ago

It all condenses down to the question if the variable x of the negative binomial distribution (or pascal distribution) should be defined as:

A: number of trials or
B: number of failures

Both, the german and english Wikipedia provide both definitions with no preference. The german lists A first, while the english lists B first.

I would suggest to stick to the first definition and clearly state this fact in the documentation.

bvenn commented 1 year ago

After some more consideration, it may be beneficial to stick with two Implementations:

NegativeBinomial_trials
NegativeBinomial_failures

with the second using the first one. In rare cases, the parameterization of the distribution is done by the number of failures instead of successes. But I don't think anyone would be confused that NegativeBinomial_failures does take the failures as input, but models the number of failures as result

fslaborg / FSharp.Stats