GEOS-DEV / GEOS

GEOS Simulation Framework
GNU Lesser General Public License v2.1
211 stars 85 forks source link

GEOSX compilation on ORNL/Summit #1334

Closed rrsettgast closed 3 years ago

rrsettgast commented 3 years ago

Create a hostconfig for ORNL/Summit using either XL/cuda or gcc/cuda.

corbett5 commented 3 years ago

I got my account, and was able to access Summit. I'll start building when I get back from vacation next week. Two questions: Do we need to use the HYPRE+CUDA ability for the LBL problem? Do we need to use any implicit solver?

Not building with HYPRE+CUDA, and using the system BLAS/LAPACK instead of ESSL would definitely make things quicker.

rrsettgast commented 3 years ago

I got my account, and was able to access Summit. I'll start building when I get back from vacation next week. Two questions: Do we need to use the HYPRE+CUDA ability for the LBL problem?

Yes. But we can start without HYPRE+CUDA and just execute the linear solver on host.

Do we need to use any implicit solver?

Yes. We defining do need to run quasi-static mechanics.

Not building with HYPRE+CUDA, and using the system BLAS/LAPACK instead of ESSL would definitely make things quicker.

Maybe a two step process. Just get it running with Trilinos on host...then once we get the build to @tjligocki to play with, we can get HYPRE+CUDA enabled.

corbett5 commented 3 years ago

Successfully built LvArray with GCC 10.2.0 and CUDA 11.2.0. I tried using GCC 8.1.1 and CUDA 10.1.243 but it broke. GCC 7.4.0 worked but we had problems in GEOSX with it. To get GCC 10.2.0 to work I had to use CUDA 11.2.0.

rrsettgast commented 3 years ago

What was the error with gcc8? @tjligocki what compiler combo are you using?

corbett5 commented 3 years ago

A CMake error when trying to verify nvcc. Something about __float128 not being defined even thought the -mno-float128 flag was being passed. Something like that, I didn't spend too much time thinking about it once 7 and 10 worked.

rrsettgast commented 3 years ago

I think was a problem when we built on ascent. Did you look at that hostconfig? https://github.com/GEOSX/LvArray/blob/develop/host-configs/ascent-gcc%408.1.1.cmake @klevzoff @wrtobin Any suggestions?

corbett5 commented 3 years ago

I mean I'm just going to keep moving forward with GCC 10.2 and CUDA 11.2, I should find out soon enough if it doesn't work.

klevzoff commented 3 years ago

Yeah at the time just adding -std=c++11 -Xcompiler -mno-float128 to RAJA seemed to have fixed the issue: https://github.com/GEOSX/thirdPartyLibs/pull/104/files . Also discussion in https://github.com/LLNL/blt/issues/341 . I don't know where things went from there, but CUDA 11 shouldn't have this problem.

corbett5 commented 3 years ago

I got the unit tests passing with GCC 10.2, CUDA 11.2 and Hypre (but not Hypre-CUDA). I'll try the integrated tests tomorrow, my hopes aren't high.

corbett5 commented 3 years ago

Now I'm stuck trying to install h5py (required for the integrated tests restart check). It doesn't come pre-installed and I'm having trouble pip installing it from a virtual environment. This was the approach I took on Cori and it worked fine so something funny is up. I may have to try and build it with Spack, currently even when building pygeosx we don't build h5py, but that should work.

corbett5 commented 3 years ago

h5py builds with Spack but then it segfaults when importing. It on Lassen I also get an import error, but not a segfault.

corbett5 commented 3 years ago

Ok I figured it out. One of the python modules has h5py installed, but it's Python 3. So I had to make some changes to let geosxats run with Python 2 but let restartcheck.py run with Python 3.

corbett5 commented 3 years ago

When run on a single node the following tests fail in addition to the tests that are currently failing on Lassen

Sneddon_conforming01               
Sneddon_conforming04               
KGD_ZeroToughness_01               
KGD_ZeroViscosity_01               
co2_hybrid_1d_01                   
co2_hybrid_1d_02                   
co2_hybrid_1d_03                   
co2_flux_3d_01                     
Sneddon_01                         
fractureMatrixFlow_Embedded2d_01
corbett5 commented 3 years ago

When I grab enough nodes to run all the tests the only new one that fails is co2_flux_3d_08.