Put CPU_TARGET and CUDA as most outer loops

At the moment*, the build loop loads modules in this order: PrgEnv -> compiler -> CPUtarget -> CUDA -> MPI

There is a corner case where this fails, i.e. if a compiler needs to be built against different CPU targets, or if it depends on CUDA. Ideally, we would want this: CPUtarget -> CUDA -> PrgEnv -> compiler -> MPI

This seems to imply a different order in loading modules in production (in scripts and interactive shells), so that documentation would need to be updated. This is way I am not doing it right now.

Besides, the only practical case I know at the moment is CLANG compiler, which Maciej is installing on Topaz and which relies on CUDA for CUDA-aware OpenMP. There's a temporary workaround, i.e. specifying CUDA as a PREREQ, not a compiler.

*this is from 1.6.10 on (previously CUDA came last, after MPI)

PawseySC / maali

Put CPU_TARGET and CUDA as most outer loops #92