Mrnorman/crm/reorder ncrms 2

Makes ncrms the fastest varying dimension for all arrays in 1-mom micro (ESMT not supported yet)
Makes all automatic fortran arrays in ported code explicitly allocated to allow managed memory use.

I've run cuda-memcheck, and it runs clean on both 2d and 3d code, and it gives the same answer as baseline on the CPU before the commits to expected tolerance (defined by a -O0 -O3 diff) after one model day of simulation. I tried to run valgrind, but it chokes on PGI-generated code on Summit. The M2005 micro option still works as well. ESMT is not supported yet on the GPU.

This code will run quite well for Gordon Bell runs where we only need 6 MPI tasks per node. However, the use of CUDA Managed memory (via -ta=nvidia,managed with PGI) causes problems with the CUDA MPS (Multi-Process Server). Thus, I currently limit 18 MPI tasks per node for GPU runs). This is problematic because the CPU code now takes longer. Nvidia is currently looking into this. For ES configuration (64x1x58 CRMs and MSA factor of 2, 4 rad columns), we can achieve 0.72 SYPD for the whole model for F compsets without I/O. This is so slow because the CPU code takes a long time (namely, radiation). The GPU-ported code itself runs at over 3 SYPD.

Expect a future PR that enables threading and runs the CRM on the master thread to reduce MPS pressure and run the CPU code faster while also keeping the GPU in ideal conditions (i.e., larger kernels).

E3SM-Project / ACME-ECP

Mrnorman/crm/reorder ncrms 2 #93