Major revamp to Halide 16.0 with Anderson2021 GPU autoscheduler

(Adding the task dependencies for my own reminder.)

[x] Wait for the Halide 16.0 release.
[x] Refactor the Halide::BoundaryConditions calls to use the new APIs;
[x] Similarly, refactor Generator::* related code to use Halide 16.0 APIs;
[x] In algorithms/ladmm.py, ensure all Numpy matrices are Fortran order by default; this avoids the frequent C-order to F-order typecasting overhead in the (L-)ADMM iterations;
[x] Similarly, ensure Halide-accelerated linear operators, e.g. A_mask.cpython.so writes to the output buffers in F-order, not some orphan buffers that are immediately destroyed. This should solve the convergence failure bugs whenever implem='Halide' is defined.
[x] Wait until Anderson2021 algorithm optimizer is ready for production (https://github.com/halide/Halide/issues/7606).
(Optional) Compile the Halide generators with C++20; this should cut the compile time in half thanks to new C++ Concepts feature;
(Optional) reduce code bloat of ladmm-iter-gen.cpp with the broadcast operator Halide::_.
[ ] Replace Li2018 autoscheduler with Anderson2021: the latter utilizes the GPU cache and shared memory in the SM far better.

comp-imaging / ProxImaL