inconsistency between Poisson loglikelihood value, gradient (when using multiplicative factors) and Hessian due to thresholding

KrisThielemans commented 1 month ago

Continuing from #1461...

We attempt to avoid zero or negatives in the log (or division be zero or negatives in the gradient) by thresholding the quotient. Unfortunately, the code is not consistent, even after many attempts.

First introducing notation: $T(e, y) = max(e, y/Q)$ (i.e. lower threshold on $e$), and (full) forward projection $e = P\lambda+b$.

formulas

A strategy could be

$\log L = \sum_i y_i \log ( T(e_i,y_i) ) - T(e_i,y_i)$ [1]

This function has a gradient w.r.t. $\lambda$ (with some element-wise multiplications etc., and ignoring the non-differentiable points)

${ P^T {\partial T \over \partial e} (e, y) ( {y \over T(e,y) } - 1) }$ [2]

with

${\partial T \over \partial e} T(e,y) = 1$ if $e>= y/Q$ and $0$ otherwise [3]

This gradient is equivalent to back-projecting

e >= y/Q ? y/e : 0 [4]

Log-likelihood code

The function [1] is what is computed by PoissonLogLikelihoodWithLinearModelForMeanAndProjData for the value via https://github.com/UCL/STIR/blob/12bfa873522653936a4818e9bad9b9da41f11706/src/buildblock/recon_array_functions.cxx#L367-L371 where the arguments are computed https://github.com/UCL/STIR/blob/12bfa873522653936a4818e9bad9b9da41f11706/src/recon_buildblock/PoissonLogLikelihoodWithLinearModelForMeanAndProjData.cxx#L1451-L1467

Gradient code

For the gradient we use the optimisation that any multiplicative factors drop out.

https://github.com/UCL/STIR/blob/12bfa873522653936a4818e9bad9b9da41f11706/src/recon_buildblock/PoissonLogLikelihoodWithLinearModelForMeanAndProjData.cxx#L1407-L1433

Unfortunately, the divide_and_truncate method (which implements the division [4]) does not know about the multiplicative factors, and is therefore using a different threshold than the logL value.

Gradient formulas

Writing the full forward projection as $e= P\lambda + b = m (G\lambda+a)$, and the forward projection without $m$ as $f = G(\lambda+a)$, the current gradient computation is

G.back_project((f >= y/Q ? y/f : 0) - m)

This is equivalent to

P.back_project((e >= m*y/Q ? y/e : 0) - 1)

Conclusion

There are 2 inconsistencies:

In both divide_and_truncate and accumulate_likelihood, Q=10000, such that the quotient in the gradient computation is not consistent with the log-term in the logL value wherever only one of those reaches the threshold, so if $m<1$ where $m y/Q < e < y/Q $.
the "sensitivity" term uses a threshold in the current logL computation, but not in the gradient computation.

KrisThielemans commented 1 month ago

Note that in most cases, $m$ is the product of attenuation factors (of the order .001 to 1) and "efficiencies", which for most scanners (without calibration) are around 1, so $m$ is likely less than 1. It also means that the threshold (on $e$) in the gradient calculation is currently lower than 10000, and will probably never be reached, which is good. (The logL calculation will not correspond to the gradient for $y$ for slightly larger $e$, but these are still small). However, as soon as we'd include duration, this won't be the case (or calibration factors, which I'm not sure what values they'd have).

Side note: the code has separate handling for the $y=0$ (or actually, y<= max(viewgram)* 0.000001) case, in which case the log terms gets dropped, as required.

Current impact estimation:

any algorithm that doesn't use the value of the log-likelihood is not affected by this discrepancy
as long as $m<=1$ (i.e. "normalisation factors" are around 1 or higher), the gradient-threshold is unlikely to be reached, and definitely not when $b>0$. In any case, it only matters where $y>0$.

Nevertheless, these thresholds should be consistent, and in any case are probably surprising to most people (as not documented). Also, the current gradient strategy is sensitive to the scale of $m$, which is dangerous.

KrisThielemans commented 1 month ago

After discussion with @mehrhardt, proposed strategy in first instance is

make threshold configurable
- allow to switch it off, in which case logL or its gradient could be undefined, NaN or infinite
- or increase $Q$ such that is less likely to be hit)
use threshold in terms of $e$, not $f$ (this implies passing $m$ to divide_and_truncate)
document behaviour

Things to consider:

use soft thresholding to keep things differentiable
move thresholding to the algorithm, as opposed to the objective function.

KrisThielemans commented 1 month ago

There's another possible choice for the objective function:

$\log L = \sum_i y_i \log ( T(e_i,y_i) ) - e_i$ [5]

This choice has a few advantages (and is also preferred by @mehrhardt):

It could be more natural to only handle problems in the log term
It allows computing the latter term via the sensitivity (it is equivalent to $\sum_v \mathrm{sensitivity}_v \lambda_v$). Otherwise, at least in principle we have to take the bins into account where the threshold was applied. Aside from being very messy, [1] is also virtually impossible to implement in the list-mode objective function.

After more discussions with @NicoleJurjew, two other things came up:

if the threshold depends on $y$, it cannot be reproduced in the list-mode objective function
to save time/memory, the gradient calculation can be done without the "sensitivity" term (this is called add_sensitivity=true, as it computes the gradient + sensitivity). In that case, $m$ is not passed to the above function (see here. That means that we cannot use a threshold strategy that depends on $m$ for the gradient.

Proposal

For $f= G\lambda+a$, implement

$\log L = \sum_i y_i \log ( \max(f_i,\epsilon) ) - m_i*f_i$ [6]

with $\epsilon$ a user parameter, hopefully defaulting to 0. Note that this will NOT be backwards compatible for ProjDatawith either the current log-likelihood nor the gradient computation, whenever the thresholds are reached (but this should be really infrequently). Note that even without thresholds, [5] differs from [1] with a global constant (depending on $m$), but it is compatible with the list-mode data objective function value

@mehrhardt @gschramm any comments?

KrisThielemans commented 1 month ago

Other things noted while investigating all this:

divide_and_truncate allows accumulating the "log-likelihood", but it is wrong unless the measured data would contain m.
the listmode obj function calculation seems to miss an else for the non-threshold case.

gschramm commented 1 month ago

To me, [5] / [6] with a default eps = 0, feel most "natural".

How often does the "log(0)" problem occur in real PET/SPECT systems where there should be always a finite amount of contamination (except for SPECT scans without scatter)?

KrisThielemans commented 1 month ago

How often does the "log(0)" problem occur in real PET/SPECT systems where there should be always a finite amount of contamination (except for SPECT scans without scatter)?

In "traditional" algorithms without precorrections, and the background model is somewhat accurate, then I don't think it should ever occur. However, there are a few cases where it could occur

people do not always do the "proper" things: examples
- NAC reconstructions without background
- reconstructions with a noisy "background" estimate
- reconstruction of precorrected data (still very often done in SPECT where scatter is precorrected (badly))
- initialisation with an image that has too small support (i.e. is zero where it shouldn't). This can also happen with OSEM (where a subset sets a voxel to zero when it shouldn't be).
step-size "overshoot", e.g. in (somewhat careless) line-searches
algorithms that explicitly allow negative voxel values, see some refs in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5854483/. Depending on algorithm/implementation, these thresholds might be useful or not.

The challenge for STIR is that most people expect it to "just work" even if they throw weird stuff at it...

gschramm commented 1 month ago

I see. I think some negative pixels, as long as the forward model stays positive, are ok.

Wouldn't it make more sense to return infinity as soon as a single bin in the expectation is <= 0? Of course that would require checks slowing things down and we would lose differentiability.

KrisThielemans commented 1 month ago

I think some negative pixels, as long as the forward model stays positive, are ok.

I don't disagree, but that's a different optimisation problem from what most people will expect.

Wouldn't it make more sense to return infinity as soon as a single bin in the expectation is <= 0? Of course that would require checks slowing things down and we would lose differentiability.

Not all architectures can represent infinity (although these days, that's possibly less of a worry, as the IEEE floating point standard is now very widely adapted and I see that even CUDA represents infinity).

Clearly, the above thresholding strategies all require checks and will slow things down, and indeed the proposed function is not differentiable. An alternative is to just not do any checks, and let the result be undefined. In my experience, that throws up problems very quickly though.

KrisThielemans commented 1 month ago

Summarising the proposal: for a user-configurable eps (a number)

compute f = G.forward_project(x) + a and S=G.back_project(m)
value: logL = sum((y>0 ? y*log(max(f,eps)) : 0) - m*f) = sum(y>0 ? y*log(max(f,eps)) : 0) - x . S.
gradient: G.back_project((y>0 && f >= eps? y/f : 0) - m) = G.back_project((y>0 && f >= eps ? y/f : 0)) - S
Hessian z: `G.back_project( (y>0 && f >= eps ? -y/f^2 : 0) G.forward(z))`
approximate Hessian z (legacy code that uses $P^t {1 \over y} P z$): `G.back_project( (y>0 ? 1/y : 0) G.forward(z))`

All of these can be computed by summing over list-mode events, aside from the "approximate" version, which is essentially not available (unless you histogram first),

There's a corner case where $y_i>0$ and $f_i<=eps$ will be undefined if $eps<=0$, although on IEEE-fp-compatible architectures will give infinity for value, gradient and Hessian on most architectures when $eps=0$).

Obviously, once infinity is returned, ugly things will happen. The only generic solution for that is to set eps>0.

KrisThielemans commented 1 month ago

Actually, I see now that eps=0 has essentially no benefit, except that it changes behaviour when $f_i<0$ (which will not happen in most circumstances). I suppose this means we should only do the checks on f when eps>0 (or when verbosity>0).

gschramm commented 1 month ago

@KrisThielemans another possibility that came to my mind yesterday.

$$ logL = \sum_i (y_i + \epsilon) \log (e_i + \epsilon) - (e_i + \epsilon) $$

with $\epsilon$ a small positive number.

I think the only problem is that this cannot be evaluated in listmode :-(

KrisThielemans commented 1 month ago

I think the only problem is that this cannot be evaluated in listmode :-(

yeah... I prefer a strategy where results are identical in both, at least in principle. It simplifies testing as well!

UCL / STIR