GEOS-DEV / GEOS

GEOS Simulation Framework
GNU Lesser General Public License v2.1
210 stars 84 forks source link

Host configs for new LLNL machines + cuda12 #3067

Closed victorapm closed 3 months ago

victorapm commented 6 months ago

Add new host configs:

See https://github.com/GEOS-DEV/LvArray/pull/319 and https://github.com/GEOS-DEV/thirdPartyLibs/pull/270

codecov[bot] commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 55.74%. Comparing base (c74702a) to head (ac3b832). Report is 89 commits behind head on develop.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #3067 +/- ## ======================================== Coverage 55.74% 55.74% ======================================== Files 1038 1038 Lines 88470 88470 ======================================== Hits 49320 49320 Misses 39150 39150 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

TotoGaz commented 6 months ago

@sframba Do you know if CUDA 12 is supported on Pangea3 @matteofrigo5 Same question on Sherlock

sframba commented 5 months ago

@sframba Do you know if CUDA 12 is supported on Pangea3 @matteofrigo5 Same question on Sherlock

Not sure, we'll have to ask the IBM support. We have upgraded to cuda 11.5.0 at the end of 2023. Is cuda 12 needed to solve the externalSolvers unit test issue? Or to access some new hypre features on GPU?

matteofrigo5 commented 5 months ago

I don't see any constraints regarding Sherlock. We have these four versions of CUDA 12 installed: (12.0.0 12.1.1 12.2.0 12.4.0). @victorapm, do you suggest any one in particular?

TotoGaz commented 5 months ago

Not sure, we'll have to ask the IBM support. We have upgraded to cuda 11.5.0 at the end of 2023. Is cuda 12 needed to solve the externalSolvers unit test issue? Or to access some new hypre features on GPU?

It's mainly for this.

CusiniM commented 3 months ago

@rrsettgast @wrtobin @castelletto1 I have installed the tpls for they ruby builds and for the lassen cuda-12 one.

On ruby:

/usr/tce/packages/gcc/gcc-12.1.1/bin/ld: /usr/gapps/GEOSX/thirdPartyLibs/2024-06-19/install-ruby-gcc-12-release/trilinos/lib/libtpetra.so.13.4.1: undefined reference to `KokkosSparse::Impl::SPMV<double const, int const, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::MemoryTraits<1u>, int const, double const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::MemoryTraits<3u>, double*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::MemoryTraits<1u>, true, true>::spmv(KokkosKernels::Experimental::Controls const&, char const*, double const&, KokkosSparse::CrsMatrix<double const, int const, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::MemoryTraits<1u>, int const> const&, Kokkos::View<double const*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::MemoryTraits<3u> > const&, double const&, Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::MemoryTraits<1u> > const&)' 

we are on an oldish version of Trilinos (13.4.1) which is from 2022. We could try to move the latest release (https://github.com/trilinos/Trilinos/releases) which is from a couple of months ago and see if that fixes the issue.

I don't have a bank on dane yet so I can't do much there but I am pretty sure that we can get everything to work on that system too.

Anyways, since we have 1 fully working build, If you are okay with it, I think we can merge this and abandon quartz in favor of ruby.

victorapm commented 3 months ago

I have not tested the new lassen build with cuda-12 yet but I think @victorapm did and should be working.

After some trial and error with different compilers, it works!

https://github.com/GEOS-DEV/LvArray/blob/fe2ad691194321af4f5f9cab3593016d8d0fc645/host-configs/LLNL/lassen-clang-13-cuda-12.cmake#L3-L10

PS: the build on Dane also works fine (tested it with a couple of simulations)

CusiniM commented 3 months ago

@CusiniM aren't dane and ruby the same stack? Can we merge the files?

yeah, that's why I created that llnl-cpu-base.cmake. I think it's convenient to keep the file separate if we ever want/need to customize something but yeah, they should be identical AFAIK. I would also expect binaries to work for both systems. The only difference would be the 1 ats parameter. I don't have a bank on dane yet so I have not been able to do much testing there.

CusiniM commented 3 months ago

@sframba can you guys have a look at why this cuda-12 build is crashing when building ElasticFirstOrderWaveEquationSEM?