List-mode objective function: minor refactor and extra functionality

KrisThielemans commented 2 months ago

This PR is intended to add objective function value as well as Hessian to the list-mode objective function. To avoid duplication between cached and non-cached versions, the plan is as follows:

split reading of a "batch" in a separate function
rewrite gradient to always use LM_distributable_computation by reading the batch either from cache or from list-mode.
add function for objective function
add Hessian

@gschramm @NikEfth Feedback welcome!

KrisThielemans commented 2 months ago

step 2 commited, but currently failing test. It'll be for tomorrow.

KrisThielemans commented 1 month ago

@NikEfth Steps 1&2 completed. Variable naming is a bit awkard, but minimises text changes. Best to compare ignore white-space diffs: https://github.com/UCL/STIR/pull/1418/files?diff=unified&w=1

KrisThielemans commented 1 month ago

Preparation is nearly done. Results for the gradient are still identical. I plan to template the loop function, and then call it

KrisThielemans commented 1 month ago

The Hessian calculation should now be functional, but it is untested ATM.

KrisThielemans commented 1 month ago

@gschramm conducted some timings via SIRF (but that shouldn't matter). Original post at https://github.com/SyneRBI/SIRF/pull/1253#issuecomment-2107661865

Timing results of Sino/LM gradients/hessian multiplies for different number of subsets

mMR LM file with 4,332,816 counts and acq_model = sirf.STIR.AcquisitionModelUsingRayTracingMatrix() and 1 tangential LOR, sirf.STIR.AcquisitionData.set_storage_scheme("memory")

7 subsets

In [14]: %timeit g = obj_fun.gradient(initial_image, subset = 0)
4.19 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit hess_out_img = obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
4.35 s ± 85.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0)
4.18 s ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]: %timeit hess_out_img_lm = lm_obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
4.13 s ± 46.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

14 subsets

In [22]: %timeit g = obj_fun.gradient(initial_image, subset = 0)
2.75 s ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [23]: %timeit hess_out_img = obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
3.07 s ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [26]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0)
4.67 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [27]: %timeit hess_out_img_lm = lm_obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
4.89 s ± 68.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

21 subsets

In [32]: %timeit g = obj_fun.gradient(initial_image, subset = 0)
2.35 s ± 37.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [33]: %timeit hess_out_img = obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0)
2.69 s ± 58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [34]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0)
6.35 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [35]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0)
6.14 s ± 53.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CPU info

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          256
On-line CPU(s) list:             0-255
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7662 64-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         1499.555
CPU max MHz:                     2000.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        4000.00
Virtualization:                  AMD-V
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        512 MiB
NUMA node0 CPU(s):               0-63,128-191
NUMA node1 CPU(s):               64-127,192-255

@gschramm In [35] is still a gradient?

My interpretation:

For projection data, more subsets will mean faster execution (less projections), until there are too many subsets so parallelisation gets inefficient.
For list-mode data, it could be that the reading of the lm dominates. I'm not sure though why more subsets would slow it down. It's always going to read all of the data. It's possible that the accumulation of the output images in LM_distributable_computation is important
```
if (output_image_ptr != NULL)
  {
    for (int i = 0; i < static_cast<int>(local_output_image_sptrs.size()); ++i)
      if (!is_null_ptr(local_output_image_sptrs[i])) // only accumulate if a thread filled something in
        *output_image_ptr += *(local_output_image_sptrs[i]);
  }
```
but this doesn't depend on the number of subsets really (I guess we could use a reduction operation there.)

Note that LM times could improve when using set_cache_max_size(5000000), which would do some operations upfront, and then re-read things from myCACHE*.bin.

KrisThielemans commented 1 month ago

The MacOS job fails on the Hessian-concavity test. The output of H*dx is 0 on MacOS, but not on other systems. I have no idea why. Most likely it doesn't read the events properly, but it does read the correct number of events, so that's hard to understand. Here's the failure diagnostics

----- testing concavity via Hessian-vector product (accumulate_Hessian_times_input)
INFO: Loaded 8019 prompts from list-mode file
INFO: Caching Additive corrections for : 8019 events.
INFO: Listmode gradient calculation: starting loop with 1 thread
INFO: Computation times for distributable_computation, CPU 0.03s, wall-clock 0.037424s
INFO: Loaded 8019 prompts from list-mode file
INFO: Caching Additive corrections for : 8019 events.
INFO: Listmode gradient calculation: starting loop with 1 thread
INFO: Computation times for distributable_computation, CPU 0.04s, wall-clock 0.037748s
Error : 0 is larger than 0, 
FAIL: PoissonLLListModeData: Computation of x^T H x = 0.000000 > 0 (Hessian) and is therefore NOT concave
 >target image max=0.999849
 >target image min=1.16359e-05
 >output image max=0
 >output image min=0

KrisThielemans commented 1 month ago

All done, or as far as I can go. The objective function value is also computed and tested.

Note that some of the numerical tests aren't very stringent, and they probably can't be as there are only 8019 prompts in the file.

I will have to disable the MacOS job for later

UCL / STIR

List-mode objective function: minor refactor and extra functionality #1418