Use all the directives ( teams ( more than 1 SIMD processor ) , parallel ( use more than one thread. For the Cray fortran compiler you also need the SIMD directive [ to check on Archer2 ] )
make sure to have enough parallelism ( O(1000) iterations )
combine teams distribute parallel for if possible. Increasingly difficult if not in the same function and not inline, and difficult to use if in different compilation unites.
use environment variables to track data movement ( CRAY_ACC_DEBUG, NVCOMPILER_ACC_NOTIFY , GOMP_DEBUG, LIBOMP_TARGET_INFO, -Minfo=mp,accel , -Rpass=openmp-opt )
prefers firstprivate for scalar variable( usually sent as a kernel parameter ), private will endup in global or local memory