Closed samhatfield closed 5 months ago
This is a significant optimisation of the CPU code path. Credit owed to @marsdeno.
TCO1279, 48-node benchmark (--norms --truncation 1279 --niter 100 --nlev 137 --nfld 1 --vordiv --uvders --scders -v):
--norms --truncation 1279 --niter 100 --nlev 137 --nfld 1 --vordiv --uvders --scders -v
develop:
Inverse-direct transforms ------------------------- avg (s): 0.4258 min (s): 0.3726 max (s): 1.2771 med (s): 0.4168 loop (s): 50.9419
pre_allocated_buffers:
Inverse-direct transforms ------------------------- avg (s): 0.2227 min (s): 0.1793 max (s): 1.1176 med (s): 0.2128 loop (s): 30.9310
Almost 2x speed-up of the median transform time with identical norms.
Looks good to me
This is a significant optimisation of the CPU code path. Credit owed to @marsdeno.
TCO1279, 48-node benchmark (
--norms --truncation 1279 --niter 100 --nlev 137 --nfld 1 --vordiv --uvders --scders -v
):develop:
pre_allocated_buffers:
Almost 2x speed-up of the median transform time with identical norms.