fslaborg / FSharp.Stats

statistical testing, linear algebra, machine learning, fitting and signal processing in F#
https://fslab.org/FSharp.Stats/
Other
205 stars 54 forks source link

Add negative binomial distribution #256

Closed bvenn closed 1 year ago

bvenn commented 1 year ago

ref

bvenn commented 1 year ago

Digging into negative binomial distribution implementations turned out to be a rabbit hole. Many packages as R - dnbinom or Python - scipy.stats.nbinom model the PMF according to provided success and failure counts.

Negative Binomial distribution

The distribution models the number of trials needed x to get the rth success in repeated independent Bernoulli trials. Until the (x-1)th trial, (r-1) successes must be accomplished, which matches the standard binomial distribution. Therefore, to get the rth success in the xth trial, you have to multiply Binom(p,x-1,r-1) by p. Therefore the PMF is:

with a support of [r -> infinity).

Example

What is the probability for the third success occurring at the 10th trial, given a independent trial success-probability of 0.09?

NegBinom(x,r,p) = 0.01356

image

However, standard R and Python functions result in: R: dnbinom(x=10,size=3,prob=0.09) = 0.01873637 Python: scipy.nbinom.pmf(k=10, n=3, p=0.09) = 0.01873637

Scipy-documentation states that

The probability mass function above is defined in the “standardized” form. To shift distribution use the loc parameter. Specifically, nbinom.pmf(k, n, p, loc) is identically equivalent to nbinom.pmf(k - loc, n, p).

k often is defined as the number of failures prior to the last success (Wikipedia top right or this online calculator). By changing the function accordingly:

scipy.nbinom.pmf(k=10, n=3, p=0.09, loc=3) or scipy.nbinom.pmf(k= 7, n=3, p=0.09) = 0.01356

the expected probability is returned.

Conclusion

With the definition given above, there is no possibility to have probabilities > 0 when x<r. Therefore I would suggest to parameterize the negative binomial distribution using:

and stick to these parameters for PMF and CDF accordingly. Switching to number of failures for the determination of PMF does not make sense for me. To my current overview this does not align to other implementations, so some further research has to be done to clarify the situation. However, overloads could be introduced to support both definitions. The parameter usage must be well defined.

References for X = number of trials:

References for X = number of failures:

References that support both definitions:

@muehlhaus, maybe you have time to have a look at this issue

bvenn commented 1 year ago

It all condenses down to the question if the variable x of the negative binomial distribution (or pascal distribution) should be defined as:

Both, the german and english Wikipedia provide both definitions with no preference. The german lists A first, while the english lists B first.

I would suggest to stick to the first definition and clearly state this fact in the documentation.

bvenn commented 1 year ago

After some more consideration, it may be beneficial to stick with two Implementations:

with the second using the first one. In rare cases, the parameterization of the distribution is done by the number of failures instead of successes. But I don't think anyone would be confused that NegativeBinomial_failures does take the failures as input, but models the number of failures as result