Closed drmichaeltcvx closed 9 months ago
Running with the smaller MAELSTROM/usecases/francois/SPE10/flow
case on an 8 X100 40GB GPU node as
mpirun --hostfile ./gpnpusc500000x.hosttab -x LD_LIBRARY_PATH -x V -x GPUMPICLI -x MPI -x MPIVER -x N_gpus -x GPU_cpu_aff_path -x GPU_mem_aff_path --np 8 --map-by ppr:8:node:PE=12 /home/mtml/cs691/utils/bin/map_ranks_gpus.sh /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/bin/geosx -i ./SPE10_small.xml -t runtime-report,max_column_
still crashes Umpire:
...
terminate called after throwing an instance of 'umpire::runtime_error'
terminate called after throwing an instance of 'umpire::runtime_error'
what(): ! Umpire runtime_error [/dev/shm/mtml/src/GEOS/thirdPartyLibs/build-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/chai/src/chai/src/tpl/umpire/src/umpire/alloc/CudaPinnedAllocator.hpp:43]: cudaFreeHost( ptr = 0x2b688e000000 ) failed with error: an illegal memory access was encountered
Backtrace: 13 frames
0 0x2b672e279f55 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire4util49_GLOBAL__N__25f8fd63_16_ArrayManager_cpp_ab41d17d15build_backtraceEv+0x35) [0x2b672e279f55]
1 0x2b672e27c514 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZNK6umpire13runtime_error7messageB5cxx11Ev+0x44) [0x2b672e27c514]
2 0x2b672e29c35f No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire13runtime_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_i+0x8f) [0x2b672e29c35f]
3 0x2b672e2ac8e1 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire5alloc19CudaPinnedAllocator10deallocateEPv+0x351) [0x2b672e2ac8e1]
4 0x2b6728edbf86 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire9Allocator10deallocateEPv+0x226) [0x2b6728edbf86]
5 0x2b6728ed1c21 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos15DomainPartitionD2Ev+0x61) [0x2b6728ed1c21]
6 0x2b6728ed1dc9 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos15DomainPartitionD0Ev+0x9) [0x2b6728ed1dc9]
7 0x2b67285956a2 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos14dataRepository5GroupD1Ev+0x8f2) [0x2b67285956a2]
8 0x2b672dd7fbb9 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos14ProblemManagerD0Ev+0x9) [0x2b672dd7fbb9]
9 0x2b672dd7c056 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos10GeosxStateD2Ev+0x3f6) [0x2b672dd7c056]
10 0x40d2dc No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/bin/geosx() [0x40d2dc]
11 0x2b6795c3c555 No dladdr: /lib64/libc.so.6(__libc_start_main+0xf5) [0x2b6795c3c555]
12 0x40e25e No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/bin/geosx() [0x40e25e]
what(): ! Umpire runtime_error [/dev/shm/mtml/src/GEOS/thirdPartyLibs/build-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/chai/src/chai/src/tpl/umpire/src/umpire/alloc/CudaPinnedAllocator.hpp:43]: cudaFreeHost( ptr = 0x2b5534000000 ) failed with error: an illegal memory access was encountered
Backtrace: 13 frames
0 0x2b53d35c8f55 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire4util49_GLOBAL__N__25f8fd63_16_ArrayManager_cpp_ab41d17d15build_backtraceEv+0x35) [0x2b53d35c8f55]
1 0x2b53d35cb514 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZNK6umpire13runtime_error7messageB5cxx11Ev+0x44) [0x2b53d35cb514]
2 0x2b53d35eb35f No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire13runtime_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_i+0x8f) [0x2b53d35eb35f]
3 0x2b53d35fb8e1 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire5alloc19CudaPinnedAllocator10deallocateEPv+0x351) [0x2b53d35fb8e1]
4 0x2b53ce22af86 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire9Allocator10deallocateEPv+0x226) [0x2b53ce22af86]
5 0x2b53ce220c21 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos15DomainPartitionD2Ev+0x61) [0x2b53ce220c21]
6 0x2b53ce220dc9 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos15DomainPartitionD0Ev+0x9) [0x2b53ce220dc9]
7 0x2b53cd8e46a2 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos14dataRepository5GroupD1Ev+0x8f2) [0x2b53cd8e46a2]
8 0x2b53d30cebb9 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos14ProblemManagerD0Ev+0x9) [0x2b53d30cebb9]
9 0x2b53d30cb056 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos10GeosxStateD2Ev+0x3f6) [0x2b53d30cb056]
10 0x40d2dc No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/bin/geosx() [0x40d2dc]
11 0x2b543af8b555 No dladdr: /lib64/libc.so.6(__libc_start_main+0xf5) [0x2b543af8b555]
12 0x40e25e No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/bin/geosx() [0x40e25e]
terminate called after throwing an instance of 'umpire::runtime_error'
terminate called after throwing an instance of 'umpire::runtime_error'
terminate called after throwing an instance of ' what(): ! Umpire runtime_error [/dev/shm/mtml/src/GEOS/thirdPartyLibs/build-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/chai/src/chai/src/tpl/umpire/src/umpire/alloc/CudaPinnedAllocator.hpp:43]: cudaFreeHost( ptr = 0x2b57b0000000 ) failed with error: an illegal memory access was encountered
Backtrace: 13 frames
0 0x2b564b056f55 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire4util49_GLOBAL__N__25f8fd63_16_ArrayManager_cpp_ab41d17d15build_backtraceEv+0x35) [0x2b564b056f55]
1 0x2b564b059514 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZNK6umpire13runtime_error7messageB5cxx11Ev+0x44) [0x2b564b059514]
2 0x2b564b07935f No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire13runtime_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_i+0x8f) [0x2b564b07935f]
3 0x2b564b0898e1 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire5alloc19CudaPinnedAllocator10deallocateEPv+0x351) [0x2b564b0898e1]
4 0x2b5645cb8f86 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire9Allocator10deallocateEPv+0x226) [0x2b5645cb8f86]
5 0x2b5645caec21 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos15DomainPartitionD2Ev+0x61) [0x2b5645caec21]
6 0x2b5645caedc9 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos15DomainPartitionD0Ev+0x9) [0x2b5645caedc9]
7 0x2b56453726a2 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos14dataRepository5GroupD1Ev+0x8f2) [0x2b56453726a2]
8 0x2b564ab5cbb9 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos14ProblemManagerD0Ev+0x9) [0x2b564ab5cbb9]
9 0x2b564ab59056 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos10GeosxStateD2Ev+0x3f6) [0x2b564ab59056]
10 0x40d2dc No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/bin/geosx() [0x40d2dc]
11 0x2b56b2a19555 No dladdr: /lib64/libc.so.6(__libc_start_main+0xf5) [0x2b56b2a19555]
12 0x40e25e No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/bin/geosx() [0x40e25e]
what(): ! Umpire runtime_error [/dev/shm/mtml/src/GEOS/thirdPartyLibs/build-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/chai/src/chai/src/tpl/umpire/src/umpire/alloc/CudaPinnedAllocator.hpp:43]: cudaFreeHost( ptr = 0x2aaddc000000 ) failed with error: an illegal memory access was encountered
Backtrace: 13 frames
0 0x2aac799a3f55 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire4util49_GLOBAL__N__25f8fd63_16_ArrayManager_cpp_ab41d17d15build_backtraceEv+0x35) [0x2aac799a3f55]
1 0x2aac799a6514 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZNK6umpire13runtime_error7messageB5cxx11Ev+0x44) [0x2aac799a6514]
2 0x2aac799c635f No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire13runtime_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_i+0x8f) [0x2aac799c635f]
3 0x2aac799d68e1 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire5alloc19CudaPinnedAllocator10deallocateEPv+0x351) [0x2aac799d68e1]
4 0x2aac74605f86 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN6umpire9Allocator10deallocateEPv+0x226) [0x2aac74605f86]
5 0x2aac745fbc21 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos15DomainPartitionD2Ev+0x61) [0x2aac745fbc21]
6 0x2aac745fbdc9 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos15DomainPartitionD0Ev+0x9) [0x2aac745fbdc9]
7 0x2aac73cbf6a2 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos14dataRepository5GroupD1Ev+0x8f2) [0x2aac73cbf6a2]
8 0x2aac794a9bb9 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos14ProblemManagerD0Ev+0x9) [0x2aac794a9bb9]
9 0x2aac794a6056 No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/lib/libgeosx_core.so(_ZN4geos10GeosxStateD2Ev+0x3f6) [0x2aac794a6056]
10 0x40d2dc No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/bin/geosx() [0x40d2dc]
11 0x2aace1366555 No dladdr: /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aace1366555]
12 0x40e25e No dladdr: /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/bin/geosx() [0x40e25e]
umpire::runtime_error'
I can reproduce the same issue by running SEAM CO2 case with GPU compilation on Cypress.
terminate called after throwing an instance of 'umpire::runtime_error'
what(): ! Umpire runtime_error [/shared/data1/Users/j0551570/Compilation/Build_120523/thirdPartyLibs/build-cypress-GPU-gcc-std17-release/chai/src/chai/src/tpl/umpire/src/umpire/alloc/CudaPinnedAllocator.hpp:44]: cudaFreeHost( ptr = 0x7fef4b600000 ) failed with error: an illegal memory access was encountered
Backtrace: 12 frames
0 0x7feff05445c3 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZNK6umpire13runtime_error7messageB5cxx11Ev+0x43) [0x7feff05445c3]
1 0x7feff054503e No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN6umpire13runtime_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_i+0x8e) [0x7feff054503e]
2 0x7feff05554a5 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN6umpire8resource21DefaultMemoryResourceINS_5alloc19CudaPinnedAllocatorEE10deallocateEPvm+0x705) [0x7feff05554a5]
3 0x7fefec2f82a3 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN6umpire9Allocator10deallocateEPv+0x193) [0x7fefec2f82a3]
4 0x7fefec2eee62 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN4geos15DomainPartitionD1Ev+0x62) [0x7fefec2eee62]
5 0x7fefec2eefc9 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN4geos15DomainPartitionD0Ev+0x9) [0x7fefec2eefc9]
6 0x7fefebba71e2 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN4geos14dataRepository5GroupD1Ev+0x822) [0x7fefebba71e2]
7 0x7feff01473f9 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN4geos14ProblemManagerD0Ev+0x9) [0x7feff01473f9]
8 0x7feff0145156 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN4geos10GeosxStateD1Ev+0x1f6) [0x7feff0145156]
9 0x40d003 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/bin/geosx() [0x40d003]
10 0x7fefe1b467b3 No dladdr: /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fefe1b467b3]
11 0x40e1ae No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/bin/geosx(_start+0x2e) [0x40e1ae]
terminate called after throwing an instance of 'umpire::runtime_error'
what(): ! Umpire runtime_error [/shared/data1/Users/j0551570/Compilation/Build_120523/thirdPartyLibs/build-cypress-GPU-gcc-std17-release/chai/src/chai/src/tpl/umpire/src/umpire/alloc/CudaPinnedAllocator.hpp:44]: cudaFreeHost( ptr = 0x7f4e5ad03000 ) failed with error: an illegal memory access was encountered
Backtrace: 12 frames
0 0x7f4effc6c5c3 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZNK6umpire13runtime_error7messageB5cxx11Ev+0x43) [0x7f4effc6c5c3]
1 0x7f4effc6d03e No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN6umpire13runtime_errorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_i+0x8e) [0x7f4effc6d03e]
2 0x7f4effc7d4a5 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN6umpire8resource21DefaultMemoryResourceINS_5alloc19CudaPinnedAllocatorEE10deallocateEPvm+0x705) [0x7f4effc7d4a5]
3 0x7f4efba202a3 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN6umpire9Allocator10deallocateEPv+0x193) [0x7f4efba202a3]
4 0x7f4efba16e62 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN4geos15DomainPartitionD1Ev+0x62) [0x7f4efba16e62]
5 0x7f4efba16fc9 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN4geos15DomainPartitionD0Ev+0x9) [0x7f4efba16fc9]
6 0x7f4efb2cf1e2 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN4geos14dataRepository5GroupD1Ev+0x822) [0x7f4efb2cf1e2]
7 0x7f4eff86f3f9 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN4geos14ProblemManagerD0Ev+0x9) [0x7f4eff86f3f9]
8 0x7f4eff86d156 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/lib/libgeosx_core.so(_ZN4geos10GeosxStateD1Ev+0x1f6) [0x7f4eff86d156]
9 0x40d003 No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/bin/geosx() [0x40d003]
10 0x7f4ef126e7b3 No dladdr: /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f4ef126e7b3]
11 0x40e1ae No dladdr: /shared/data1/Users/j0551570/Compilation/Build_120523/GEOS/build-cypress-GPU-gcc-std17-release/bin/geosx(_start+0x2e) [0x40e1ae]
Describe the bug Running GPU GEOS with MAELSTROM/usecases/francois/SPE10/flow/ on a single 8XA100 80GB node crashes with the following Umpire messages:
To Reproduce Steps to reproduce the behavior:
SPE10_refined.xml
:mpirun --hostfile ./gpnpusc600002f.hosttab -x LD_LIBRARY_PATH -x V -x GPUMPICLI -x MPI -x MPIVER -x N_gpus -x GPU_cpu_aff_path -x GPU_mem_aff_path --np 5 --map-by ppr:5:node:PE=19 /home/mtml/cs691/utils/bin/map_ranks_gpus.sh /data/saet/mtml/software/x86_64/RHEL7/GEOS/0.2.0/install-GPU-Hypre-GCC-CUDA_11.8-ompi_hpcx-OMP-relwithdebinfo/bin/geosx -i ./SPE10_refined.xml -t runtime-report,max_column_width=200,calc.inclusive,mpi-report -x 1 -y 5 -z 1
map_ranks_gpus.sh
just selects a GPU unit for the current rankMinimal case : run the
SPE10_refined.xml
case.Expected behavior GPU GEOS is expected to run to completion.
Screenshots This is from a run with 4 ranks and 1 OMP thread and 1 GPU per rank.
Platform (please complete the following information):
Additional context Add any other context about the problem here.