SPECFEM / specfem3d

SPECFEM3D_Cartesian simulates acoustic (fluid), elastic (solid), coupled acoustic/elastic, poroelastic or seismic wave propagation in any type of conforming mesh of hexahedra (structured or not).
GNU General Public License v3.0
404 stars 225 forks source link

🐛 [BUG] - segfault when running multiple forward simulations on GPUs #1726

Open bch0w opened 1 month ago

bch0w commented 1 month ago

Background

Interesting problem here, hoping someone can help (@danielpeter?). I'm working with @ykane and his student to run inversions with SeisFlows on their GPU cluster in Japan, and we are encountering a segmentation fault issue from xmeshfem3D that occurs when we try to run multiple, separate forward simulations at the same time. The error message points to OMPI Cuda but I'm a bit stumped about what is going on at a deeper level.

Setup:

  1. Compile SPECFEM3D GPU version, successfully able to run forward simulations with any number of GPUs (cluster has 45 nodes with 8 A100's each, each GPU has 56GB memory)
  2. Set up SeisFlows to run multiple forward simulations in parallel for an inversion. This setup has been working with the CPU version of the code, but we recently switched to GPUs for speedup.
  3. Attempt to run 3 forward simulations simultaneously with SeisFlows (really any number N>=3 but 3 seems to be the minimum number to encounter this problem)
  4. At least 1 submitted jobs will fail with the following job error message during startup of xmeshfem3D
[wa34:00095] *** Process received signal ***
[wa34:00095] Signal: Segmentation fault (11)
[wa34:00095] Signal code: Address not mapped (1)
[wa34:00095] Failing at address: 0x4a82bc001108
[wa34:00095] [ 0] /usr/lib64/libpthread.so.0(+0x12ce0)[0x148f82701ce0]
[wa34:00095] [ 1] /work/opt/local/x86_64/apps/gcc/8.3.1/openmpi-cuda/4.1.5-12.2/lib/libopen-pal.so.40(+0xaa54f)[0x148f8399154f]
[wa34:00095] [ 2] /work/opt/local/x86_64/apps/gcc/8.3.1/openmpi-cuda/4.1.5-12.2/lib/libopen-pal.so.40(+0xacb06)[0x148f83993b06]
[wa34:00095] [ 3] /work/opt/local/x86_64/apps/gcc/8.3.1/openmpi-cuda/4.1.5-12.2/lib/libopen-pal.so.40(opal_hwloc201_hwloc_shmem_topology_write+0xd7)[0x148f8396f707]
[wa34:00095] [ 4] /work/opt/local/x86_64/apps/gcc/8.3.1/openmpi-cuda/4.1.5-12.2/lib/openmpi/mca_rtc_hwloc.so(+0x2937)[0x148f76bcb937]
[wa34:00095] [ 5] /work/opt/local/x86_64/apps/gcc/8.3.1/openmpi-cuda/4.1.5-12.2/lib/libopen-rte.so.40(orte_rtc_base_select+0xda)[0x148f83c63afa]
[wa34:00095] [ 6] /work/opt/local/x86_64/apps/gcc/8.3.1/openmpi-cuda/4.1.5-12.2/lib/openmpi/mca_ess_hnp.so(+0x4e17)[0x148f8082de17]
[wa34:00095] [ 7] /work/opt/local/x86_64/apps/gcc/8.3.1/openmpi-cuda/4.1.5-12.2/lib/libopen-rte.so.40(orte_init+0x2ae)[0x148f83c6b93e]
[wa34:00095] [ 8] /work/opt/local/x86_64/apps/gcc/8.3.1/openmpi-cuda/4.1.5-12.2/lib/libopen-rte.so.40(orte_submit_init+0x8f0)[0x148f83c1d5f0]
[wa34:00095] [ 9] mpiexec[0x400eaf]
[wa34:00095] [10] /usr/lib64/libc.so.6(__libc_start_main+0xf3)[0x148f82364ca3]
[wa34:00095] [11] mpiexec[0x400d3e]
[wa34:00095] *** End of error message ***
  1. No output_mesher.txt file is made, this job fails directly on process startup
  2. If I rerun the failed job manually, it runs fine
  3. If I repeat this process, a different event may fail, pointing to maybe a race condition (?) rather than an event-specific issue

Current Thinking

This issue seems to be isolated for the case when multiple jobs are submitted together. I thought it might be competition for shared resources, so I tried to reduce the problem size (database files < 1GB, 1 GPU per simulation) but with no luck.

I'm also not sure if this is a SPECFEM or a cluster issue. I will also be opening a ticket with their help desk to see if they have any ideas.

When googling this specific error message I'm seeing some maybe relevant forum topics: https://users.open-mpi.narkive.com/KdMDfEeA/ompi-segfault-with-mpi-cuda-on-multiple-nodes https://users.open-mpi.narkive.com/Nd99QZhy/ompi-segmentation-fault-address-not-mapped

I'm hitting the depth of my knowledge so hoping to shop this out to the community and see if someone has seen this before or knows what might be going on. Happy to provide more details or Par_files or log files if needed. Thanks!

Dependency Versions

Affected SPECFEM3D version

SPECFEM3D 4.1.0; bf45798

Your software and hardware environment

UTokyo Wisteria GPU Cluster; Nvidia A100; RHEL8.6

OS

Linux

danielpeter commented 4 weeks ago

Is this a CUDA-aware OpenMPI library you're using?

SPECFEM3D_Cartesian has no CUDA-aware MPI support. thus, i wouldn't try to use it with a CUDA-aware library, just a plain and simple MPI library. CUDA-aware code needs to set the GPU ids first before initializing MPI. this is not done in the Cartesian version, only the globe version has this feature. so, if the library expects some GPU ids first, before MPI initialization is called, then the SPECFEM runs probably behave erratically. that might also explain why it already fails in running the mesher, which has no GPU support anyway.