Unifying cpu/external and gpu/external

I'm thinking ahead to after when redgreen-optimized is merged into develop. After this merge, src/trans will have this structure (ext: external, int: internal):

                    src/trans
                   /         \
                  /           \
                cpu           gpu
              /     \        /   \
             /       \      /     \
           ext       int  ext     int

cpu/internal and gpu/internal are of course substantially different, and probably not worth trying to combine at this stage (same for algor etc., not shown). But there is a lot of overlap between cpu/external and gpu/external so I'm wondering if we can somehow combine these two to make

                    src/trans
                   /    |   \
                  /     |    \
                 /      |     \
                /       |      \
              ext    cpu_int  gpu_int

In the relevant CMakeLists.txt, we would just have to modify the source file lists for the trans_cpu and trans_gpu libraries.

Here's a breakdown of the differences in every file between cpu/external and gpu/external:

dir_transad.F90: only whitespace differences.
dir_trans.F90: GSTATS overloaded by GSTATS_NVTX -> put overload in a #if defined(__NVCOMPILER statement?
dist_grid_32.F90: only whitespace differences.
dist_grid.F90: only whitespace differences.
dist_spec.F90: GPU version is out of date, but could be updated.
gath_grid_32.F90: only whitespace differences.
gath_grid.F90: only whitespace differences.
gath_spec.F90: only whitespace differences.
get_current.F90: only module USE differences - trivial.
gpnorm_trans.F90: significant differences, however gpu/internal/gpnorm_trans_ctl.F90 could be reinstated so cpu/gpu differences are hidden by unified GPNORM_TRANS subroutine.
gpnorm_trans_gpu.F90: do we actually need this?
ini_spec_dist.F90: only whitespace differences.
inv_transad.F90: only whitespace differences.
inv_trans.F90: same issue with GSTATS_NVTX as above. Also arguments seem to be validated slightly differently between cpu and gpu, but I think these should be the same. Probably resolvable.
setup_trans0.F90: only meaningful differences are in the GPU pinning logic, but this could be wrapped in preprocessor statements so it's only compiled when targeting GPUs.
setup_trans.F90: big differences. This is where device memory is allocated. For now, wrap this in GPU-specific preprocessor regions?
specnorm.F90: literally identical.
sugawc.F90: literally identical.
trans_end.F90: some device deallocation code, which could be wrapped in GPU-specific preprocessor regions.
trans_inq.F90: some variable cast differences, but nothing meaningful.
trans_pnm.F90: references to JPRBT in GPU version which could be kept.
trans_release.F90: literally identical.
vordiv_to_uv.F90: only whitespace differences.

So most files are basically the same already, and only two or three would require some thought. Seems like an obvious way to reduce the complexity and code volumn of the library.

ecmwf-ifs / ectrans

Unifying cpu/external and gpu/external #96