Bounds estimation for produced photons / electrons

JelleAalbers commented 4 years ago

If you set max_sigma high enough, flamedisx will give accurate results regardless of how bad the bounds estimation is. Empirically we found that max_sigma 5 is reasonable for our current bound estimation.

It would be good to redo the derivations of the bound estimation formulas, and see if we can improve them. This would improve our speed / accuracy profile.

JelleAalbers commented 4 years ago

Specifically, the problem is estimating the number of produced [photons/electrons] N given a known detection efficiency p and number of [photons/electrons] detected, k.

There is a stackexchange post on this here. The accepted answer suggest a procedure involving the negative binomial. Unless I made a mistake while implementing it, it has unacceptably high false exclusions (low coverage) for low N and p. This is for 99% intervals:

stackoverflow

You could also imagine an oversimplified solution like this:

Start with the N for which the observed k is the most likely outcome Nm = k/p
Find the relative error on k-values, given a binomial with that N: relative_error = mu /sigma = Nm p / sqrt(Nm p (1-p)
Assume this is also the relative error of Nm; i.e. report the interval Nm * (1 +- max_sigma * relative_error) on N. The performance for 99% intervals is similar to the stackoverflow result:

oversimplified_gaussian_argument

Currently, flamedisx uses a mysterious equation here, leading to an interval of Nm +- max_sigma * (q + (q**2 + 4 * Nm * q)**0.5)/2, with q=1/p. I'd have to dig up how this was derived; likely in a handwavy / semi-empirical / let's-add-factors-of-two-until-it-works way. However, it does seem to be sufficiently conservative for low p-values:

mystery_equation

However, you can see the asymptote is no longer correct. For higher p-values the mystery interval is much too conservative -- the false exclusion rate is in the 10^-4 range around N=4, then drops even more. This makes flamedisx slower, but at least its values are not wrong.

For the moment, we could keep the mystery equation, but it is a bit embarrassing and makes things unnecessarily slow. Perhaps we could instead use one of the other motivated solutions, with an 'empirical fix' for the low p, low N regime? Finally, an option is to compute and cache exact confidence intervals using the Neyman construction in the low (N, p) regime. We'd have to do it for several p's and interpolate somehow. Fortunately, this is all in the numpy part of flamedisx, specifically the annotation stage, so we can be creative.

pelssers commented 4 years ago

We should definitely redo the derivation of the mystery formula, but there might be a simple fix in the meantime. Setting q=(1-p)/p in the mystery formula (inspired by reading the stackoverflow post you linked) instead of the original q=1/p I get the below result: mystery_equation_solved

JelleAalbers commented 3 years ago

Believe it or not, but exactly a year after I wrote the long post above, the derivation of the mystery equation has washed back up from the seas of time...

Let me try to translate, but remember it's a hand-wavy argument, though apparently somewhat successful:

A binomial(N, p) has variance / mean = (N p (1-p)) / N p = (1-p) = q
We're after the inverse problem, estimating N given an observed k. There are two issues: A. For a given N, the possible k values are spread out; B. Our best estimate of N (Nm = k/p) has an error (which we want to find out)
We know A gives a variance / mean of q, i.e. σ^2 = mean * q
To consider B, we replace mean -> mean + σ to find the +1 sigma bound to get σ^2 = (mean + σ) * q
Solving for σ, σ = 0.5 (q +- sqrt(q^2 + 4 mean q)); we can discard the - sign solution since σ must be positive.

This is indeed the 1 sigma bound used in the mystery equation, assuming I made a typo in coding up q=1/p instead of q=1-p. Why q = (1-p)/p = 1/p-1 worked better for @pelssers I have no idea; maybe q=1-p is even better ;-)

I also didn't consider that the bounds might be asymmetric. Repeating the exercise with mean -> mean - σ gives σ = 0.5 (-q +- sqrt(q^2 + 4 mean q)); clearly the + sign gives a positive σ, but one which is smaller than for the positive case.

CristianAntochi commented 3 years ago

Hi @JelleAalbers! I think I encountered a similar problem in a textbook exercise when computing the 1sigma confidence belt for a binomial p. I think it makes sense. However with the toyMCs for that exercise I too got at some point q~1/p and I think it comes from an inversion related to Nm=k/p.

FlamTeam / flamedisx

Bounds estimation for produced photons / electrons #66