Closed samhatfield closed 5 days ago
Tests failing for unknown reasons. Investigating...
Tests were failing because the arrays containing time step stats were not resized for the extended (by two elements) time step loop in the benchmark program. Intel and NVHPC were happy with this for some reason.
I think we probably want to aim for a single (unified) interface folder, rather than have separate ones for cpu and gpu?
I think we probably want to aim for a single (unified) interface folder, rather than have separate ones for cpu and gpu?
I guess it could be done after this PR
I think we probably want to aim for a single (unified) interface folder, rather than have separate ones for cpu and gpu?
I guess it could be done after this PR
We need to work on a unified interface for any combination of double, single, cpu, gpu. Part of this work is already in develop on avoiding duplicate symbols.
I can confirm that all norms produced by the tests are identical between the CPU version in this branch and develop. To check this I had the norms before and after the benchmark loop dumped in hexadecimal to a text file and compared this between the full CTest suite between both branches. That's a nice feature we might like to implement properly at some point.
The include directories are definitely borked in this branch. Until we have unified interfaces, for now I suggest we aim for
install
|
--> include
|
--> ectrans
|
--> cpu
--> gpu
I tried implementing this first for the CPU version but I got lost and ended up creating an cpu/cpu directory at some point. Need a bit of help there.
Looking great so far. Some conflicts need resolving. I suggest merging 'develop' into this branch once more to resolve conflicts.
Is something substantial holding this back? (perhaps holidays :) )
Apologies, I should have given an update here. One big thing which makes me hesitant to merge is the fact that this branch is not yet usable within the IFS even on Nvidia platforms. When I use this branch within IFS-RAPS, I get numerical instabilities. This issue only appeared a couple weeks ago when I gave the branch a spin in RAPS on JSC/JURECA-DC (A100). This was a surprise to me as I successfully used an earlier version of rgo-develop within RAPS with (GPU enabled). I'm sure we can get to the bottom of it, but it's hard to say how long it will take.
We could just get this PR over the line now with big warning signs that it is unstable. Or we could nail this bug beforehand. Any thoughts from others?
I'd say as long as CPU and GPU works in the standalone benchmark and doesn't add regression over current behaviour (i.e., CPU version fully functional in IFS) there's nothing that should prohibit merging this and fixing incrementally going forward.
Great, so all good for integration. Could I please propose a squash-merge of this branch into develop to keep history manageable?
By all means, do whatever you think will give us the cleanest history.
I will attempt to merge the latest develop into rgo-develop now.
Oh ffs. I have to open a new PR now?
This will bring GPU capability into the develop branch.
@wdeconinck and I discussed again the issue of how to do this merge yesterday. As @wdeconinck pointed out before, if we just merge develop into redgreen-optimized before submitting a PR of the latter, we should be able to do the merge in one go rather than first preparing develop with the cpu/gpu subdirectory structure. That's what I've tried here. It seems to have worked, so if we merge this PR we no longer need https://github.com/ecmwf-ifs/ectrans/pull/103.
This is a draft as I want to carry out tests on both cpu and gpu versions first. According to Meld there are (literally) no differences between src/trans in develop and src/trans/cpu in rgo-develop. So I think this PR shouldn't screw anything up on the CPU side at least.
Note that this branch is still not stable on AMD platforms due to an outstanding issue with HIPFFT. Solving this is on the todo list. In the meantime, for AMD support you have to fall back on redgreengpu, which is the unoptimised but more portable branch.