I'm thinking ahead to after when redgreen-optimized is merged into develop. After this merge, src/trans will have this structure (ext: external, int: internal):
src/trans
/ \
/ \
cpu gpu
/ \ / \
/ \ / \
ext int ext int
cpu/internal and gpu/internal are of course substantially different, and probably not worth trying to combine at this stage (same for algor etc., not shown). But there is a lot of overlap between cpu/external and gpu/external so I'm wondering if we can somehow combine these two to make
In the relevant CMakeLists.txt, we would just have to modify the source file lists for the trans_cpu and trans_gpu libraries.
Here's a breakdown of the differences in every file between cpu/external and gpu/external:
dir_transad.F90: only whitespace differences.
dir_trans.F90: GSTATS overloaded by GSTATS_NVTX -> put overload in a #if defined(__NVCOMPILER statement?
dist_grid_32.F90: only whitespace differences.
dist_grid.F90: only whitespace differences.
dist_spec.F90: GPU version is out of date, but could be updated.
gath_grid_32.F90: only whitespace differences.
gath_grid.F90: only whitespace differences.
gath_spec.F90: only whitespace differences.
get_current.F90: only module USE differences - trivial.
gpnorm_trans.F90: significant differences, however gpu/internal/gpnorm_trans_ctl.F90 could be reinstated so cpu/gpu differences are hidden by unified GPNORM_TRANS subroutine.
gpnorm_trans_gpu.F90: do we actually need this?
ini_spec_dist.F90: only whitespace differences.
inv_transad.F90: only whitespace differences.
inv_trans.F90: same issue with GSTATS_NVTX as above. Also arguments seem to be validated slightly differently between cpu and gpu, but I think these should be the same. Probably resolvable.
setup_trans0.F90: only meaningful differences are in the GPU pinning logic, but this could be wrapped in preprocessor statements so it's only compiled when targeting GPUs.
setup_trans.F90: big differences. This is where device memory is allocated. For now, wrap this in GPU-specific preprocessor regions?
specnorm.F90: literally identical.
sugawc.F90: literally identical.
trans_end.F90: some device deallocation code, which could be wrapped in GPU-specific preprocessor regions.
trans_inq.F90: some variable cast differences, but nothing meaningful.
trans_pnm.F90: references to JPRBT in GPU version which could be kept.
trans_release.F90: literally identical.
vordiv_to_uv.F90: only whitespace differences.
So most files are basically the same already, and only two or three would require some thought. Seems like an obvious way to reduce the complexity and code volumn of the library.
I'm thinking ahead to after when redgreen-optimized is merged into develop. After this merge,
src/trans
will have this structure (ext: external, int: internal):cpu/internal
andgpu/internal
are of course substantially different, and probably not worth trying to combine at this stage (same for algor etc., not shown). But there is a lot of overlap betweencpu/external
andgpu/external
so I'm wondering if we can somehow combine these two to makeIn the relevant CMakeLists.txt, we would just have to modify the source file lists for the
trans_cpu
andtrans_gpu
libraries.Here's a breakdown of the differences in every file between
cpu/external
andgpu/external
:GSTATS
overloaded byGSTATS_NVTX
-> put overload in a#if defined(__NVCOMPILER
statement?USE
differences - trivial.GPNORM_TRANS
subroutine.GSTATS_NVTX
as above. Also arguments seem to be validated slightly differently between cpu and gpu, but I think these should be the same. Probably resolvable.JPRBT
in GPU version which could be kept.So most files are basically the same already, and only two or three would require some thought. Seems like an obvious way to reduce the complexity and code volumn of the library.