Open bch0w opened 1 month ago
Is this a CUDA-aware OpenMPI library you're using?
SPECFEM3D_Cartesian has no CUDA-aware MPI support. thus, i wouldn't try to use it with a CUDA-aware library, just a plain and simple MPI library. CUDA-aware code needs to set the GPU ids first before initializing MPI. this is not done in the Cartesian version, only the globe version has this feature. so, if the library expects some GPU ids first, before MPI initialization is called, then the SPECFEM runs probably behave erratically. that might also explain why it already fails in running the mesher, which has no GPU support anyway.
Background
Interesting problem here, hoping someone can help (@danielpeter?). I'm working with @ykane and his student to run inversions with SeisFlows on their GPU cluster in Japan, and we are encountering a segmentation fault issue from
xmeshfem3D
that occurs when we try to run multiple, separate forward simulations at the same time. The error message points to OMPI Cuda but I'm a bit stumped about what is going on at a deeper level.Setup:
xmeshfem3D
output_mesher.txt
file is made, this job fails directly on process startupCurrent Thinking
This issue seems to be isolated for the case when multiple jobs are submitted together. I thought it might be competition for shared resources, so I tried to reduce the problem size (database files < 1GB, 1 GPU per simulation) but with no luck.
I'm also not sure if this is a SPECFEM or a cluster issue. I will also be opening a ticket with their help desk to see if they have any ideas.
When googling this specific error message I'm seeing some maybe relevant forum topics: https://users.open-mpi.narkive.com/KdMDfEeA/ompi-segfault-with-mpi-cuda-on-multiple-nodes https://users.open-mpi.narkive.com/Nd99QZhy/ompi-segmentation-fault-address-not-mapped
I'm hitting the depth of my knowledge so hoping to shop this out to the community and see if someone has seen this before or knows what might be going on. Happy to provide more details or Par_files or log files if needed. Thanks!
Dependency Versions
Affected SPECFEM3D version
SPECFEM3D 4.1.0; bf45798
Your software and hardware environment
UTokyo Wisteria GPU Cluster; Nvidia A100; RHEL8.6
OS
Linux