Closed KrisThielemans closed 1 month ago
step 2 commited, but currently failing test. It'll be for tomorrow.
@NikEfth Steps 1&2 completed. Variable naming is a bit awkard, but minimises text changes. Best to compare ignore white-space diffs: https://github.com/UCL/STIR/pull/1418/files?diff=unified&w=1
Preparation is nearly done. Results for the gradient are still identical. I plan to template the loop function, and then call it
The Hessian calculation should now be functional, but it is untested ATM.
@gschramm conducted some timings via SIRF (but that shouldn't matter). Original post at https://github.com/SyneRBI/SIRF/pull/1253#issuecomment-2107661865
Timing results of Sino/LM gradients/hessian multiplies for different number of subsets
mMR LM file with 4,332,816 counts and
acq_model = sirf.STIR.AcquisitionModelUsingRayTracingMatrix()
and 1 tangential LOR,sirf.STIR.AcquisitionData.set_storage_scheme("memory")
7 subsets
In [14]: %timeit g = obj_fun.gradient(initial_image, subset = 0) 4.19 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [15]: %timeit hess_out_img = obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0) 4.35 s ± 85.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [18]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0) 4.18 s ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [19]: %timeit hess_out_img_lm = lm_obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0) 4.13 s ± 46.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14 subsets
In [22]: %timeit g = obj_fun.gradient(initial_image, subset = 0) 2.75 s ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [23]: %timeit hess_out_img = obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0) 3.07 s ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [26]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0) 4.67 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [27]: %timeit hess_out_img_lm = lm_obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0) 4.89 s ± 68.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
21 subsets
In [32]: %timeit g = obj_fun.gradient(initial_image, subset = 0) 2.35 s ± 37.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [33]: %timeit hess_out_img = obj_fun.accumulate_Hessian_times_input(current_estimate, input_img, subset=0) 2.69 s ± 58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [34]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0) 6.35 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [35]: %timeit g = lm_obj_fun.gradient(initial_image, subset=0) 6.14 s ± 53.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CPU info
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 256 On-line CPU(s) list: 0-255 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7662 64-Core Processor Stepping: 0 Frequency boost: enabled CPU MHz: 1499.555 CPU max MHz: 2000.0000 CPU min MHz: 1500.0000 BogoMIPS: 4000.00 Virtualization: AMD-V L1d cache: 4 MiB L1i cache: 4 MiB L2 cache: 64 MiB L3 cache: 512 MiB NUMA node0 CPU(s): 0-63,128-191 NUMA node1 CPU(s): 64-127,192-255
@gschramm In [35]
is still a gradient
?
My interpretation:
LM_distributable_computation
is important
if (output_image_ptr != NULL)
{
for (int i = 0; i < static_cast<int>(local_output_image_sptrs.size()); ++i)
if (!is_null_ptr(local_output_image_sptrs[i])) // only accumulate if a thread filled something in
*output_image_ptr += *(local_output_image_sptrs[i]);
}
but this doesn't depend on the number of subsets really (I guess we could use a reduction operation there.)
Note that LM times could improve when using set_cache_max_size(5000000)
, which would do some operations upfront, and then re-read things from myCACHE*.bin
.
The MacOS job fails on the Hessian-concavity test. The output of H*dx is 0 on MacOS, but not on other systems. I have no idea why. Most likely it doesn't read the events properly, but it does read the correct number of events, so that's hard to understand. Here's the failure diagnostics
----- testing concavity via Hessian-vector product (accumulate_Hessian_times_input)
INFO: Loaded 8019 prompts from list-mode file
INFO: Caching Additive corrections for : 8019 events.
INFO: Listmode gradient calculation: starting loop with 1 thread
INFO: Computation times for distributable_computation, CPU 0.03s, wall-clock 0.037424s
INFO: Loaded 8019 prompts from list-mode file
INFO: Caching Additive corrections for : 8019 events.
INFO: Listmode gradient calculation: starting loop with 1 thread
INFO: Computation times for distributable_computation, CPU 0.04s, wall-clock 0.037748s
Error : 0 is larger than 0,
FAIL: PoissonLLListModeData: Computation of x^T H x = 0.000000 > 0 (Hessian) and is therefore NOT concave
>target image max=0.999849
>target image min=1.16359e-05
>output image max=0
>output image min=0
All done, or as far as I can go. The objective function value is also computed and tested.
Note that some of the numerical tests aren't very stringent, and they probably can't be as there are only 8019 prompts in the file.
I will have to disable the MacOS job for later
This PR is intended to add objective function value as well as Hessian to the list-mode objective function. To avoid duplication between cached and non-cached versions, the plan is as follows:
LM_distributable_computation
by reading the batch either from cache or from list-mode.@gschramm @NikEfth Feedback welcome!