Cache Ktb in least_square

Retrieve the pointer address of the offset buffer as hash. Pre-compute and cache the value of Ktb. Implement the feature memoized_expr to delay computation of offset + b in least_square.prox() function.

On average, it cuts ~30% execution time per call (860ms -> 600ms) to least_square.prox() on x86_64 CPU having 8 logical cores.

TODO(Antony): Too many level of caches already (Ktb in Numpy and F_Ktb in Halide). Eliminate all upstream caches.