KullbackLeibler All numba methods

There are a few problems with the numba implementation. Below is a list of comments using the latest major release of numba=0.58 and llvmlit=0.41 (Sept2023).

1) njit is better in my opinion, specially when you want to add more on the decorator, dtypes, fastmath etc 2) in the case of parallel=True it is better to default to half of threads, using set_num_threads for every jitted function 3) Also, atm, the numba call method will raise the following warning

The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see https://numba.readthedocs.io/en/stable/user/parallel.html#diagnostics for help.

File "../../../../../../tmp/ipykernel_2184/1742547634.py", line 40:
<source missing, REPL/exec in use?>

  warnings.warn(errors.NumbaPerformanceWarning(msg,

Also running parallel_diagnostics(level=4)

================================================================================
 Parallel Accelerator Optimizing:  Function kl_div_old_cil, 
/tmp/ipykernel_2184/1742547634.py (40)  
================================================================================
No source available
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
----------------------------- Before Optimisation ------------------------------
--------------------------------------------------------------------------------
------------------------------ After Optimisation ------------------------------
Parallel structure is already optimal.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

---------------------------Loop invariant code motion---------------------------
Allocation hoisting:
No allocation hoisting found

Instruction hoisting:
No instruction hoisting found
--------------------------------------------------------------------------------

I believed this is because, flat is inside the prange iteration and every iter a flatiter is returned. In the case of the gradient method with or without mask the kernel crashes.

4) Using ravel() instead of flat is better in my opinion. In CIL, as_array() does not return a new copy. Also, ravel will not return a new copy, just a view, since it is C_CONTIGUOUS. In SIRF, as_array() will return a new copy but ravel will not return a new copy for the same reason. Also, ravel seems to be faster.

5) I am not sure if mask is the right way to run KL methods when we encounter division by zero. First of all mask as an array is never used. Only the indices that mask[i]>0. In practice, we assume that A is a positive operator, so a mask could be A*1. However, it is possible that under strange geometries, this may not be true. Maybe is better to update out at indices where x + eta>0 in the case of gradient.

6) When numpy is used inside jitted functions, fastmath=True is suggested, i.e., np.sqrt

Some benchmarks

N, M, K = 500,500,500
ig = ImageGeometry(N,M,K)

Call

%timeit -r 10 -n 10 kl_div_call_ravel(xx_np.ravel(), b_np.ravel(), eta_np.ravel())
58.2 ms ± 3.69 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

%timeit -r 10 -n 10 kl_div_call_flat(xx_np, b_np, eta_np)
1.28 s ± 12.9 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

Gradient

%timeit -r 20 -n 20 kl_gradient(xx_np, b_np, out1_np, eta_np)
39.2 ms ± 559 µs per loop (mean ± std. dev. of 20 runs, 20 loops each)

%timeit -r 20 -n 20 kl_gradient_ravel(xx_np.ravel(), b_np.ravel(), out2_np.ravel(), eta_np.ravel())
39.7 ms ± 715 µs per loop (mean ± std. dev. of 20 runs, 20 loops each)

Gradient(Mask)

%timeit -r 20 -n 20 kl_gradient_mask(xx_np, b_np, out1_np, eta_np, mask_np)
38.7 ms ± 460 µs per loop (mean ± std. dev. of 20 runs, 20 loops each)

%timeit -r 20 -n 20 kl_gradient_ravel_mask(xx_np.ravel(), b_np.ravel(), out2_np.ravel(), eta_np.ravel(), pos_ind)
The slowest run took 9.17 times longer than the fastest. This could mean that an intermediate result is being cached.
57.6 µs ± 55.7 µs per loop (mean ± std. dev. of 20 runs, 20 loops each)

TomographicImaging / CIL