Vectorize np loops in limiter_optim_iter_full

amametjanov commented 9 years ago

There are 8 np loops in limiter_optim_iter_full subroutine in prim_advection_mod.F90. In most cases, np is 4 and most of the loops have trip counts of 4-by-4, 4, or 16. Since the call to this subroutine is already inside a nested OMP parallel region, further improvement should be done with SIMD. If vectorization is not possible, we should explore loop unroll by a factor of 4.

mrnorman commented 9 years ago

Are you talking about manually unrolling? I think this is something the compiler should be doing for us, right? Regarding SIMD, a lot of those loops (not all though) are reductions. Can SIMD instructions run on reduction loops? I know that for the GPU port, we don't thread down into the np x np loops because of reductions over these small np x np chunks of data.

amametjanov commented 9 years ago

A few months ago I looked at compiler generated listings for other subroutines in derivative_mod.F90 and saw that neither unroll nor SIMD was happening. SIMD was not done because it was deemed 'not profitable'. IIRC, np was also not deduced to be a compile-time constant to enable further optimizations. I am logging this issue here to put in our backlog tasks. This subroutine is called over a million times.

I saw an improvement with manual unroll in edge_mod.F90 based on GPTL timers. Will check how we do after integrating into ACME/models/atm/cam/src/dynamics/se/share/edge_mod.F90.

E3SM-Project / transport_se

Vectorize np loops in limiter_optim_iter_full #13