E3SM-Project / Omega

Next generation ocean model within E3SM
https://docs.e3sm.org/Omega/omega
Other
4 stars 5 forks source link

Update Halo class to allow halo exchanges for device arrays #163

Open brian-oneill opened 1 week ago

brian-oneill commented 1 week ago

This PR updates the Halo to allow for halo exchanges of arrays allocated in device memory space as well as host memory space. With this update, Omega can take advantage of GPU aware MPI implementations.

Changes include:

Successfully built and passed unit tests with OMEGA_MPI_ON_DEVICE both on and off on Chrysalis (intel), Perlmutter CPU (intel) & GPU (nvidiagpu), and Frontier CPU (crayclang) & GPU (crayclanggpu)

Checklist

rljacob commented 1 week ago

Curious why you test Perlmutter GPU with nvidia instead of gnu (which is what E3SM and SCREAM test with).

grnydawn commented 1 week ago

Curious why you test Perlmutter GPU with nvidia instead of gnu (which is what E3SM and SCREAM test with).

@rljacob , I think we want Omega to support both NVIDIA and GNU compilers on Perlmutter GPU nodes. Since Perlmutter uses NVIDIA GPUs, I think that we typically test Omega with the NVIDIA compiler first and the GNU compiler next. However, if E3SM and SCREAM consider GNU as the primary compiler on Perlmutter, we may also adopt the same compiler preference.

rljacob commented 1 week ago

We have not actually done a performance comparison between nvidia, gnu and intel on perlmutter gpus. But we have seen nvidia have trouble with some of the Fortran code. gnu is preferred unless there is evidence another one is better.

mark-petersen commented 5 days ago

Confirmed that PR passes unit tests on Frontier with cpu and gpu. Since the unit tests show this is working correctly because they create arrays with unique values per cell, do a halo exchange, and then compute the error.

Awaiting timing tests from Kieran Ringel for performance comparison between this new halo exchange on device and the previous halo exchange on host (gpu versus cpu).

kieran-ringel commented 1 day ago

Screenshot 2024-11-25 at 10 51 31 AM Timing results for this PR with OMEGA_MPI_ON_DEVICE turned on and off (indicted in second half of name in the legend)

mwarusz commented 1 day ago

Timing results for this PR with OMEGA_MPI_ON_DEVICE turned on and off (indicted in second half of name in the legend)

@kieran-ringel Note that depending on what exactly you measured this bug https://github.com/E3SM-Project/Omega/pull/163#discussion_r1855850029 might have affected these results, since it causes state and tracer halo exchanges to exchange host arrays only.

kieran-ringel commented 1 day ago

Updating timing with updated exchange device arrays in State and Tracer exchangeHalo functions Screenshot 2024-11-25 at 1 11 55 PM