UCL / STIR

Software for Tomographic Image Reconstruction
http://stir.sourceforge.net/
Other
108 stars 90 forks source link

inconsistency between Poisson loglikelihood value, gradient (when using multiplicative factors) and Hessian due to thresholding #1472

Open KrisThielemans opened 1 month ago

KrisThielemans commented 1 month ago

Continuing from #1461...

We attempt to avoid zero or negatives in the log (or division be zero or negatives in the gradient) by thresholding the quotient. Unfortunately, the code is not consistent, even after many attempts.

First introducing notation: $T(e, y) = max(e, y/Q)$ (i.e. lower threshold on $e$), and (full) forward projection $e = P\lambda+b$.

formulas

A strategy could be

$\log L = \sum_i y_i \log ( T(e_i,y_i) ) - T(e_i,y_i)$ [1]

This function has a gradient w.r.t. $\lambda$ (with some element-wise multiplications etc., and ignoring the non-differentiable points)

${ P^T {\partial T \over \partial e} (e, y) ( {y \over T(e,y) } - 1) }$ [2]

with

${\partial T \over \partial e} T(e,y) = 1$ if $e>= y/Q$ and $0$ otherwise [3]

This gradient is equivalent to back-projecting

e >= y/Q ? y/e : 0 [4]

Log-likelihood code

The function [1] is what is computed by PoissonLogLikelihoodWithLinearModelForMeanAndProjData for the value via https://github.com/UCL/STIR/blob/12bfa873522653936a4818e9bad9b9da41f11706/src/buildblock/recon_array_functions.cxx#L367-L371 where the arguments are computed https://github.com/UCL/STIR/blob/12bfa873522653936a4818e9bad9b9da41f11706/src/recon_buildblock/PoissonLogLikelihoodWithLinearModelForMeanAndProjData.cxx#L1451-L1467

Gradient code

For the gradient we use the optimisation that any multiplicative factors drop out.

https://github.com/UCL/STIR/blob/12bfa873522653936a4818e9bad9b9da41f11706/src/recon_buildblock/PoissonLogLikelihoodWithLinearModelForMeanAndProjData.cxx#L1407-L1433

Unfortunately, the divide_and_truncate method (which implements the division [4]) does not know about the multiplicative factors, and is therefore using a different threshold than the logL value.

Gradient formulas

Writing the full forward projection as $e= P\lambda + b = m (G\lambda+a)$, and the forward projection without $m$ as $f = G(\lambda+a)$, the current gradient computation is

G.back_project((f >= y/Q ? y/f : 0) - m)

This is equivalent to

P.back_project((e >= m*y/Q ? y/e : 0) - 1)

Conclusion

There are 2 inconsistencies:

KrisThielemans commented 1 month ago

Note that in most cases, $m$ is the product of attenuation factors (of the order .001 to 1) and "efficiencies", which for most scanners (without calibration) are around 1, so $m$ is likely less than 1. It also means that the threshold (on $e$) in the gradient calculation is currently lower than 10000, and will probably never be reached, which is good. (The logL calculation will not correspond to the gradient for $y$ for slightly larger $e$, but these are still small). However, as soon as we'd include duration, this won't be the case (or calibration factors, which I'm not sure what values they'd have).

Side note: the code has separate handling for the $y=0$ (or actually, y<= max(viewgram)* 0.000001) case, in which case the log terms gets dropped, as required.

Current impact estimation:

Nevertheless, these thresholds should be consistent, and in any case are probably surprising to most people (as not documented). Also, the current gradient strategy is sensitive to the scale of $m$, which is dangerous.

KrisThielemans commented 1 month ago

After discussion with @mehrhardt, proposed strategy in first instance is

Things to consider:

KrisThielemans commented 1 month ago

There's another possible choice for the objective function:

$\log L = \sum_i y_i \log ( T(e_i,y_i) ) - e_i$ [5]

This choice has a few advantages (and is also preferred by @mehrhardt):

After more discussions with @NicoleJurjew, two other things came up:

Proposal

For $f= G\lambda+a$, implement

$\log L = \sum_i y_i \log ( \max(f_i,\epsilon) ) - m_i*f_i$ [6]

with $\epsilon$ a user parameter, hopefully defaulting to 0. Note that this will NOT be backwards compatible for ProjDatawith either the current log-likelihood nor the gradient computation, whenever the thresholds are reached (but this should be really infrequently). Note that even without thresholds, [5] differs from [1] with a global constant (depending on $m$), but it is compatible with the list-mode data objective function value

@mehrhardt @gschramm any comments?

KrisThielemans commented 1 month ago

Other things noted while investigating all this:

gschramm commented 1 month ago

To me, [5] / [6] with a default eps = 0, feel most "natural".

How often does the "log(0)" problem occur in real PET/SPECT systems where there should be always a finite amount of contamination (except for SPECT scans without scatter)?

KrisThielemans commented 1 month ago

How often does the "log(0)" problem occur in real PET/SPECT systems where there should be always a finite amount of contamination (except for SPECT scans without scatter)?

In "traditional" algorithms without precorrections, and the background model is somewhat accurate, then I don't think it should ever occur. However, there are a few cases where it could occur

The challenge for STIR is that most people expect it to "just work" even if they throw weird stuff at it...

gschramm commented 1 month ago

I see. I think some negative pixels, as long as the forward model stays positive, are ok.

Wouldn't it make more sense to return infinity as soon as a single bin in the expectation is <= 0? Of course that would require checks slowing things down and we would lose differentiability.

KrisThielemans commented 1 month ago

I think some negative pixels, as long as the forward model stays positive, are ok.

I don't disagree, but that's a different optimisation problem from what most people will expect.

Wouldn't it make more sense to return infinity as soon as a single bin in the expectation is <= 0? Of course that would require checks slowing things down and we would lose differentiability.

Not all architectures can represent infinity (although these days, that's possibly less of a worry, as the IEEE floating point standard is now very widely adapted and I see that even CUDA represents infinity).

Clearly, the above thresholding strategies all require checks and will slow things down, and indeed the proposed function is not differentiable. An alternative is to just not do any checks, and let the result be undefined. In my experience, that throws up problems very quickly though.

KrisThielemans commented 1 month ago

Summarising the proposal: for a user-configurable eps (a number)

All of these can be computed by summing over list-mode events, aside from the "approximate" version, which is essentially not available (unless you histogram first),

There's a corner case where $y_i>0$ and $f_i<=eps$ will be undefined if $eps<=0$, although on IEEE-fp-compatible architectures will give infinity for value, gradient and Hessian on most architectures when $eps=0$).

Obviously, once infinity is returned, ugly things will happen. The only generic solution for that is to set eps>0.

KrisThielemans commented 1 month ago

Actually, I see now that eps=0 has essentially no benefit, except that it changes behaviour when $f_i<0$ (which will not happen in most circumstances). I suppose this means we should only do the checks on f when eps>0 (or when verbosity>0).

gschramm commented 1 month ago

@KrisThielemans another possibility that came to my mind yesterday.

$$ logL = \sum_i (y_i + \epsilon) \log (e_i + \epsilon) - (e_i + \epsilon) $$

with $\epsilon$ a small positive number.

I think the only problem is that this cannot be evaluated in listmode :-(

KrisThielemans commented 1 month ago

I think the only problem is that this cannot be evaluated in listmode :-(

yeah... I prefer a strategy where results are identical in both, at least in principle. It simplifies testing as well!