Open JelleAalbers opened 4 years ago
Specifically, the problem is estimating the number of produced [photons/electrons] N
given a known detection efficiency p
and number of [photons/electrons] detected, k
.
There is a stackexchange post on this here. The accepted answer suggest a procedure involving the negative binomial. Unless I made a mistake while implementing it, it has unacceptably high false exclusions (low coverage) for low N and p. This is for 99% intervals:
You could also imagine an oversimplified solution like this:
Nm = k/p
relative_error = mu /sigma = Nm p / sqrt(Nm p (1-p)
Nm * (1 +- max_sigma * relative_error)
on N.
The performance for 99% intervals is similar to the stackoverflow result:Currently, flamedisx uses a mysterious equation here, leading to an interval of Nm +- max_sigma * (q + (q**2 + 4 * Nm * q)**0.5)/2
, with q=1/p
. I'd have to dig up how this was derived; likely in a handwavy / semi-empirical / let's-add-factors-of-two-until-it-works way. However, it does seem to be sufficiently conservative for low p-values:
However, you can see the asymptote is no longer correct. For higher p-values the mystery interval is much too conservative -- the false exclusion rate is in the 10^-4 range around N=4, then drops even more. This makes flamedisx slower, but at least its values are not wrong.
For the moment, we could keep the mystery equation, but it is a bit embarrassing and makes things unnecessarily slow. Perhaps we could instead use one of the other motivated solutions, with an 'empirical fix' for the low p, low N regime? Finally, an option is to compute and cache exact confidence intervals using the Neyman construction in the low (N, p) regime. We'd have to do it for several p's and interpolate somehow. Fortunately, this is all in the numpy part of flamedisx, specifically the annotation stage, so we can be creative.
We should definitely redo the derivation of the mystery formula, but there might be a simple fix in the meantime.
Setting q=(1-p)/p
in the mystery formula (inspired by reading the stackoverflow post you linked) instead of the original q=1/p
I get the below result:
Believe it or not, but exactly a year after I wrote the long post above, the derivation of the mystery equation has washed back up from the seas of time...
Let me try to translate, but remember it's a hand-wavy argument, though apparently somewhat successful:
σ = 0.5 (q +- sqrt(q^2 + 4 mean q))
; we can discard the - sign solution since σ must be positive.This is indeed the 1 sigma bound used in the mystery equation, assuming I made a typo in coding up q=1/p instead of q=1-p. Why q = (1-p)/p = 1/p-1 worked better for @pelssers I have no idea; maybe q=1-p is even better ;-)
I also didn't consider that the bounds might be asymmetric. Repeating the exercise with mean -> mean - σ gives σ = 0.5 (-q +- sqrt(q^2 + 4 mean q))
; clearly the + sign gives a positive σ, but one which is smaller than for the positive case.
Hi @JelleAalbers! I think I encountered a similar problem in a textbook exercise when computing the 1sigma confidence belt for a binomial p. I think it makes sense. However with the toyMCs for that exercise I too got at some point q~1/p and I think it comes from an inversion related to Nm=k/p.
If you set max_sigma high enough, flamedisx will give accurate results regardless of how bad the bounds estimation is. Empirically we found that max_sigma 5 is reasonable for our current bound estimation.
It would be good to redo the derivations of the bound estimation formulas, and see if we can improve them. This would improve our speed / accuracy profile.