ecmwf-ifs / ectrans

Global spherical harmonics transforms library underpinning the IFS
Apache License 2.0
15 stars 30 forks source link

Merge redgreen-optimized into develop #106

Closed samhatfield closed 5 days ago

samhatfield commented 1 month ago

This will bring GPU capability into the develop branch.

@wdeconinck and I discussed again the issue of how to do this merge yesterday. As @wdeconinck pointed out before, if we just merge develop into redgreen-optimized before submitting a PR of the latter, we should be able to do the merge in one go rather than first preparing develop with the cpu/gpu subdirectory structure. That's what I've tried here. It seems to have worked, so if we merge this PR we no longer need https://github.com/ecmwf-ifs/ectrans/pull/103.

This is a draft as I want to carry out tests on both cpu and gpu versions first. According to Meld there are (literally) no differences between src/trans in develop and src/trans/cpu in rgo-develop. So I think this PR shouldn't screw anything up on the CPU side at least.

Note that this branch is still not stable on AMD platforms due to an outstanding issue with HIPFFT. Solving this is on the todo list. In the meantime, for AMD support you have to fall back on redgreengpu, which is the unoptimised but more portable branch.

samhatfield commented 1 month ago

Tests failing for unknown reasons. Investigating...

samhatfield commented 1 month ago

Tests were failing because the arrays containing time step stats were not resized for the extended (by two elements) time step loop in the benchmark program. Intel and NVHPC were happy with this for some reason.

marsdeno commented 1 month ago

I think we probably want to aim for a single (unified) interface folder, rather than have separate ones for cpu and gpu?

marsdeno commented 1 month ago

I think we probably want to aim for a single (unified) interface folder, rather than have separate ones for cpu and gpu?

I guess it could be done after this PR

wdeconinck commented 1 month ago

I think we probably want to aim for a single (unified) interface folder, rather than have separate ones for cpu and gpu?

I guess it could be done after this PR

We need to work on a unified interface for any combination of double, single, cpu, gpu. Part of this work is already in develop on avoiding duplicate symbols.

samhatfield commented 1 month ago

I can confirm that all norms produced by the tests are identical between the CPU version in this branch and develop. To check this I had the norms before and after the benchmark loop dumped in hexadecimal to a text file and compared this between the full CTest suite between both branches. That's a nice feature we might like to implement properly at some point.

samhatfield commented 1 month ago

The include directories are definitely borked in this branch. Until we have unified interfaces, for now I suggest we aim for

install
|
 --> include
     |
      --> ectrans
          |
           --> cpu
           --> gpu

I tried implementing this first for the CPU version but I got lost and ended up creating an cpu/cpu directory at some point. Need a bit of help there.

wdeconinck commented 3 weeks ago

Looking great so far. Some conflicts need resolving. I suggest merging 'develop' into this branch once more to resolve conflicts.

wdeconinck commented 5 days ago

Is something substantial holding this back? (perhaps holidays :) )

samhatfield commented 5 days ago

Apologies, I should have given an update here. One big thing which makes me hesitant to merge is the fact that this branch is not yet usable within the IFS even on Nvidia platforms. When I use this branch within IFS-RAPS, I get numerical instabilities. This issue only appeared a couple weeks ago when I gave the branch a spin in RAPS on JSC/JURECA-DC (A100). This was a surprise to me as I successfully used an earlier version of rgo-develop within RAPS with (GPU enabled). I'm sure we can get to the bottom of it, but it's hard to say how long it will take.

We could just get this PR over the line now with big warning signs that it is unstable. Or we could nail this bug beforehand. Any thoughts from others?

reuterbal commented 5 days ago

I'd say as long as CPU and GPU works in the standalone benchmark and doesn't add regression over current behaviour (i.e., CPU version fully functional in IFS) there's nothing that should prohibit merging this and fixing incrementally going forward.

wdeconinck commented 5 days ago

Great, so all good for integration. Could I please propose a squash-merge of this branch into develop to keep history manageable?

samhatfield commented 5 days ago

By all means, do whatever you think will give us the cleanest history.

samhatfield commented 5 days ago

I will attempt to merge the latest develop into rgo-develop now.

samhatfield commented 5 days ago

Oh ffs. I have to open a new PR now?