LLNL / serac

Serac is a high order nonlinear thermomechanical simulation code
BSD 3-Clause "New" or "Revised" License
182 stars 33 forks source link

CUDA Docker Container #1141

Closed chapman39 closed 2 months ago

chapman39 commented 3 months ago

This PR adds a new docker container with CUDA, so we can test GPU support in Azure. Due to space limitations of Azure VMs, TPLs are built with +shared and serac is built with -DBUILD_SHARED_LIBS=ON for Docker containers.

Fixes #1117

chapman39 commented 2 months ago

@samuelpmishLLNL @white238 I'm getting a weird NVCC error. It's complaining about a null pointer dereference in the stdlib.

https://dev.azure.com/llnl-serac/serac/_build/results?buildId=12183&view=logs&j=6e1d03e6-cc5b-563e-720e-c51be027141d&t=f6d84462-d4c0-58c0-06af-1f9ba7b544b8&l=1563

In file included from /usr/include/c++/12/functional:59,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/gtest-matchers.h:43,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/internal/gtest-death-test-internal.h:47,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/gtest-death-test.h:43,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/gtest.h:65,
                 from /home/serac/serac/src/serac/physics/tests/thermal_finite_diff.cpp:10:
In member function ‘bool std::_Function_base::_M_empty() const’,
    inlined from ‘_Res std::function<_Res(_ArgTypes ...)>::operator()(_ArgTypes ...) const [with _Res = serac::tuple<mfem::Vector&, serac::Functional<serac::H1<1>(serac::H1<1, 2>, serac::H1<1>, serac::H1<1>), serac::ExecutionSpace::CPU>::Gradient&>; _ArgTypes = {double}]’ at /usr/include/c++/12/bits/std_function.h:589:14,
    inlined from ‘serac::FiniteElementDual& serac::HeatTransfer<order, dim, serac::Parameters<parameter_space ...>, std::integer_sequence<int, parameter_indices ...> >::computeTimestepSensitivity(size_t) [with int order = 1; int dim = 2; parameter_space = {}; int ...parameter_indices = {}]’ at /home/serac/serac/src/serac/infrastructure/../../serac/physics/heat_transfer.hpp:964:78:
/usr/include/c++/12/bits/std_function.h:247:37: error: null pointer dereference [-Werror=null-dereference]
  247 |     bool _M_empty() const { return !_M_manager; }
      |                                     ^~~~~~~~~~
cc1plus: all warnings being treated as errors
make[2]: *** [src/serac/physics/tests/CMakeFiles/thermal_finite_diff.dir/build.make:76: src/serac/physics/tests/CMakeFiles/thermal_finite_diff.dir/thermal_finite_diff.cpp.o] Error 1
make[2]: Leaving directory '/home/serac/serac/_serac_build_and_test_2024_07_23_15_30_44/build-gcc@12.3.0_cuda'
make[1]: *** [CMakeFiles/Makefile2:3058: src/serac/physics/tests/CMakeFiles/thermal_finite_diff.dir/all] Error 2
white238 commented 2 months ago

@samuelpmishLLNL @white238 I'm getting a weird NVCC error. It's complaining about a null pointer dereference in the stdlib.

https://dev.azure.com/llnl-serac/serac/_build/results?buildId=12183&view=logs&j=6e1d03e6-cc5b-563e-720e-c51be027141d&t=f6d84462-d4c0-58c0-06af-1f9ba7b544b8&l=1563

In file included from /usr/include/c++/12/functional:59,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/gtest-matchers.h:43,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/internal/gtest-death-test-internal.h:47,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/gtest-death-test.h:43,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/gtest.h:65,
                 from /home/serac/serac/src/serac/physics/tests/thermal_finite_diff.cpp:10:
In member function ‘bool std::_Function_base::_M_empty() const’,
    inlined from ‘_Res std::function<_Res(_ArgTypes ...)>::operator()(_ArgTypes ...) const [with _Res = serac::tuple<mfem::Vector&, serac::Functional<serac::H1<1>(serac::H1<1, 2>, serac::H1<1>, serac::H1<1>), serac::ExecutionSpace::CPU>::Gradient&>; _ArgTypes = {double}]’ at /usr/include/c++/12/bits/std_function.h:589:14,
    inlined from ‘serac::FiniteElementDual& serac::HeatTransfer<order, dim, serac::Parameters<parameter_space ...>, std::integer_sequence<int, parameter_indices ...> >::computeTimestepSensitivity(size_t) [with int order = 1; int dim = 2; parameter_space = {}; int ...parameter_indices = {}]’ at /home/serac/serac/src/serac/infrastructure/../../serac/physics/heat_transfer.hpp:964:78:
/usr/include/c++/12/bits/std_function.h:247:37: error: null pointer dereference [-Werror=null-dereference]
  247 |     bool _M_empty() const { return !_M_manager; }
      |                                     ^~~~~~~~~~
cc1plus: all warnings being treated as errors
make[2]: *** [src/serac/physics/tests/CMakeFiles/thermal_finite_diff.dir/build.make:76: src/serac/physics/tests/CMakeFiles/thermal_finite_diff.dir/thermal_finite_diff.cpp.o] Error 1
make[2]: Leaving directory '/home/serac/serac/_serac_build_and_test_2024_07_23_15_30_44/build-gcc@12.3.0_cuda'
make[1]: *** [CMakeFiles/Makefile2:3058: src/serac/physics/tests/CMakeFiles/thermal_finite_diff.dir/all] Error 2

My recommendation is to turn off warnings as errors for this build and log an issue. Off-hand I can't figure out where that is actually coming from w/o a deeper look.

chapman39 commented 2 months ago

On the codevelop azure pipelines built with BUILD_SHARED_LIBS=ON, I got some errors while running serac tests:

https://dev.azure.com/llnl-serac/serac/_build/results?buildId=12196&view=logs&j=6120c41f-dd84-5658-817e-72df38d78194&t=c1c85309-f09f-566e-0fcd-ac74c434141f&l=2675

5: [ERROR in line 974 of file /home/serac/serac/src/serac/numerics/equation_solver.cpp]
5: MESSAGE=AMGX requested in non-GPU build

and

7: HDF5-DIAG: Error detected in HDF5 (1.8.23) thread 0:
7:   #000: /home/serac/serac_tpls/build_stage/spack-stage-hdf5-1.8.23-chknnutjvno3stawugxsj4u4a6nci7s6/spack-src/src/H5L.c line 1131 in H5Literate(): link iteration failed
7:     major: Symbol table
7:     minor: Iteration failed
7:   #001: /home/serac/serac_tpls/build_stage/spack-stage-hdf5-1.8.23-chknnutjvno3stawugxsj4u4a6nci7s6/spack-src/src/H5Gint.c line 812 in H5G_iterate(): error iterating over links
7:     major: Symbol table
7:     minor: Iteration failed
7:   #002: /home/serac/serac_tpls/build_stage/spack-stage-hdf5-1.8.23-chknnutjvno3stawugxsj4u4a6nci7s6/spack-src/src/H5Gobj.c line 661 in H5G__obj_iterate(): can't iterate over dense links
7:     major: Symbol table
7:     minor: Iteration failed
7:   #003: /home/serac/serac_tpls/build_stage/spack-stage-hdf5-1.8.23-chknnutjvno3stawugxsj4u4a6nci7s6/spack-src/src/H5Gdense.c line 1020 in H5G__dense_iterate(): iteration operator failed
7:     major: Symbol table
7:     minor: Can't move to next iterator location
7:   #004: /home/serac/serac_tpls/build_stage/spack-stage-hdf5-1.8.23-chknnutjvno3stawugxsj4u4a6nci7s6/spack-src/src/H5Glink.c line 478 in H5G__link_iterate_table(): iteration operator failed
7:     major: Symbol table
7:     minor: Can't move to next iterator location

among others. I was having some trouble figuring out why. The AMGX one is especially weird since I've checked the CMakeCache.txt and MFEM_USE_AMGX is false. For now, I simply set build shared off for codevelop, but we might want to figure this out.

white238 commented 2 months ago

Thanks for sticking through this @chapman39 ! I know it was not a minor feat.