Retrieve the pointer address of the offset buffer as hash. Pre-compute and cache the value of Ktb. Implement the feature memoized_expr to delay computation of offset + b in least_square.prox() function.
On average, it cuts ~30% execution time per call (860ms -> 600ms) to least_square.prox() on x86_64 CPU having 8 logical cores.
TODO(Antony): Too many level of caches already (Ktb in Numpy and F_Ktb in Halide). Eliminate all upstream caches.
Retrieve the pointer address of the offset buffer as hash. Pre-compute and cache the value of Ktb. Implement the feature
memoized_expr
to delay computation ofoffset + b
in least_square.prox() function.On average, it cuts ~30% execution time per call (860ms -> 600ms) to
least_square.prox()
on x86_64 CPU having 8 logical cores.TODO(Antony): Too many level of caches already (
Ktb
in Numpy andF_Ktb
in Halide). Eliminate all upstream caches.Resolves: #12 .