Closed bvenn closed 1 year ago
Digging into negative binomial distribution implementations turned out to be a rabbit hole. Many packages as
R - dnbinom
or Python - scipy.stats.nbinom
model the PMF according to provided success and failure counts.
The distribution models the number of trials needed x
to get the r
th success in repeated independent Bernoulli trials.
Until the (x-1)th trial, (r-1) successes must be accomplished, which matches the standard binomial distribution.
Therefore, to get the r
th success in the x
th trial, you have to multiply Binom(p,x-1,r-1)
by p
.
Therefore the PMF is:
r-1
successes in x-1
trials = binomial distribution: $\binom{x-1}{r-1}p^{r-1}(1-p)^{(x-1)-(r-1)}$with a support of [r -> infinity).
What is the probability for the third success occurring at the 10th trial, given a independent trial success-probability of 0.09?
NegBinom(x,r,p) = 0.01356
However, standard R and Python functions result in:
R: dnbinom(x=10,size=3,prob=0.09)
= 0.01873637
Python: scipy.nbinom.pmf(k=10, n=3, p=0.09)
= 0.01873637
Scipy-documentation states that
The probability mass function above is defined in the “standardized” form. To shift distribution use the loc parameter. Specifically, nbinom.pmf(k, n, p, loc) is identically equivalent to nbinom.pmf(k - loc, n, p).
k
often is defined as the number of failures prior to the last success (Wikipedia top right or this online calculator). By changing the function accordingly:
scipy.nbinom.pmf(k=10, n=3, p=0.09, loc=3)
or
scipy.nbinom.pmf(k= 7, n=3, p=0.09) = 0.01356
the expected probability is returned.
With the definition given above, there is no possibility to have probabilities > 0 when x<r. Therefore I would suggest to parameterize the negative binomial distribution using:
and stick to these parameters for PMF and CDF accordingly. Switching to number of failures for the determination of PMF does not make sense for me. To my current overview this does not align to other implementations, so some further research has to be done to clarify the situation. However, overloads could be introduced to support both definitions. The parameter usage must be well defined.
@muehlhaus, maybe you have time to have a look at this issue
It all condenses down to the question if the variable x of the negative binomial distribution (or pascal distribution) should be defined as:
Both, the german and english Wikipedia provide both definitions with no preference. The german lists A first, while the english lists B first.
I would suggest to stick to the first definition and clearly state this fact in the documentation.
After some more consideration, it may be beneficial to stick with two Implementations:
with the second using the first one. In rare cases, the parameterization of the distribution is done by the number of failures instead of successes. But I don't think anyone would be confused that NegativeBinomial_failures does take the failures as input, but models the number of failures as result
ref