Crash in parallel simulations when flushing particle data on JUWELS Booster

MaxThevenet commented 2 years ago

A production 3D simulation with openPMD output crashes at the first particle flush, when running in parallel. This input file is a reproducer showing the problem on a simplified setup. This is executed with the following submission script on the Juwels Booster. The CMake command and output can be found here, and I used the following profile file. The crash gave the following files: error.txt and Backtrace.

Note: the same run on a V100-equipped cluster ran successfully with the following CMake output ran successfully.

ax3l commented 2 years ago

Thanks for the detailed report!

As discussed on Slack, we see this problem only on Juwels so far and the same run works on other clusters.

The backtrace indicates that the problem originates straight out of the MPI-I/O layer (ROMIO). That's a bit curious, because by default OpenMPI uses OMPIO as it's I/O implementation instead of ROMIO, so something seems to be non-default on Juwels. I see that your profile, sourced in the submission script, has

# change the MPI-IO backend in OpenMPI from OMPIO to ROMIO (experimental)
#export OMPI_MCA_io=romio321

in it before the srun call. Let's comment this line out to make sure the default OMPIO implementation is used. OMPIO is pretty buggy itself, but I reported/fixed a series of bugs in the past related to HDF5 I/O in the past, and the OpenMPI 4.1.1 on Juwels should contain all those fixes: https://github.com/openPMD/openPMD-api/issues/446

Another thing we discussed is to ask the cluster support for the newest version of HDF5 in the 1.10 series, so providing instead of HDF5 1.10.6 the 1.10.8 release. Cluster support could also run a few tests, e.g., with hdf5-iotest and ior to check if the MPI-I/O layer and HDF5 implementation are generally in working condition.

ax3l commented 2 years ago

If this works, then we should remove the hint to change this in our docs: https://github.com/ECP-WarpX/WarpX/blob/development/Docs/source/install/hpc/juwels.rst

I think we used this temporarily to work around another earlier issue on Juwels.

ax3l commented 2 years ago

OMPIO errors still with (test from Maxence):

[jwb0065.juwels:28280] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.jwb0065.juwels.17674/jf.0/4086174241/openpmd_00010.h5_cid-5-28280.sm
[jwb0065.juwels:28279] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.jwb0065.juwels.17674/jf.0/4086174241/openpmd_00010.h5_cid-5-28280.sm
[jwb0065.juwels:28277] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.jwb0065.juwels.17674/jf.0/4086174241/openpmd_00010.h5_cid-5-28280.sm
[jwb0065.juwels:28278] mca_sharedfp_sm_file_open: Error, unable to open file for mmap: /tmp/ompi.jwb0065.juwels.17674/jf.0/4086174241/openpmd_00010.h5_cid-5-28280.sm
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 2:
  #000: H5F.c line 444 in H5Fcreate(): unable to create file
    major: File accessibilty
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 1:
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 0:
  #000: H5F.c line 444 in H5Fcreate(): unable to create file
  #000: H5F.c line 444 in H5Fcreate(): unable to create file
    major: File accessibilty
    minor: Unable to open file
    major: File accessibilty
    minor: Unable to open file
HDF5-DIAG: Error detected in HDF5 (1.10.6) MPI-process 3:
  #000: H5F.c line 444 in H5Fcreate(): unable to create file
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 1498 in H5F_open(): unable to open file: time = Thu Nov 11 18:41:15 2021
, name = 'diags/injection/openpmd_00010.h5', tent_flags = 13
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 1498 in H5F_open(): unable to open file: time = Thu Nov 11 18:41:15 2021
, name = 'diags/injection/openpmd_00010.h5', tent_flags = 13

ax3l commented 2 years ago

Let's see if we can work-around this via:

export OMPI_MCA_io=ompio
export HDF5_USE_FILE_LOCKING=FALSE

Update: same error.

ax3l commented 2 years ago

The first errors:

mca_sharedfp_sm_file_open: Error, unable to open file for mmap

point to a ulimit issue: https://github.com/open-mpi/ompi/issues/4336

Update:ulimit -n returns 524288 (pretty good) and the above linked issue is fixed in OpenMPI 4.1.1

ax3l commented 2 years ago

So the OMPIO problem here seems to point to a problem opening temporary files on Juwels for mmap (which is memory mapping for files): https://github.com/open-mpi/ompi/blob/v4.1.1/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c#L129-L141

This seems to be part of the OpenMPI sharedfp framework, so it likely has its own controls that we can try: https://www.open-mpi.org/faq/?category=ompio#sharedfp-parametesrs

For more exhaustive tuning of I/O parameters, we recommend the utilization of the Open Tool for Parameter Optimization (OTPO), a tool specifically designed to explore the MCA parameter space of Open MPI.

That tool might be something for @AndiH? :)

ax3l commented 2 years ago

I asked about additional --mca options that we could try to modify or skip the sharedfp framework component in https://github.com/open-mpi/ompi/issues/9656

damianam commented 2 years ago

@AndiH brought me here. I am probably the guy you want to talk to if you have problems on the JUWELS Booster and/or the MPIs there.

Bear with me now, I am not an OpenMPI expert. We explicitly disable the ompio framework via the mpi-settings environment module:

$ ml show mpi-settings | grep setenv
setenv("EBROOTOPENMPIMINSETTINGS","/p/software/juwelsbooster/stages/2020/software/OpenMPI-settings/4.1CUDA")
setenv("EBVERSIONOPENMPIMINSETTINGS","4.1")
setenv("EBDEVELOPENMPIMINSETTINGS","/p/software/juwelsbooster/stages/2020/software/OpenMPI-settings/4.1CUDA/easybuild/MPI_settings-OpenMPI-4.1-mpi-settings-CUDA-easybuild-devel")
setenv("SLURM_MPI_TYPE","pspmix")
setenv("UCX_TLS","rc_x,cuda_ipc,gdr_copy,self,sm,cuda_copy")
setenv("UCX_MEMTYPE_CACHE","n")
setenv("UCX_MAX_RNDV_RAILS","1")
setenv("OMPI_MCA_mca_base_component_show_load_errors","1")
setenv("OMPI_MCA_mpi_param_check","1")
setenv("OMPI_MCA_mpi_show_handle_leaks","1")
setenv("OMPI_MCA_mpi_warn_on_fork","1")
setenv("OMPI_MCA_btl","^uct,openib")
setenv("OMPI_MCA_btl_openib_allow_ib","1")
setenv("OMPI_MCA_bml_r2_show_unreach_errors","0")
setenv("OMPI_MCA_coll","^ml")
setenv("OMPI_MCA_coll_hcoll_enable","1")
setenv("OMPI_MCA_coll_hcoll_np","0")
setenv("OMPI_MCA_pml","ucx")
setenv("OMPI_MCA_osc","^rdma")
setenv("OMPI_MCA_opal_abort_print_stack","1")
setenv("OMPI_MCA_opal_set_max_sys_limits","1")
setenv("OMPI_MCA_opal_event_include","epoll")
setenv("OMPI_MCA_btl_openib_warn_default_gid_prefix","0")
setenv("OMPI_MCA_io","romio321")

If you actively enable ompio you are exploring uncharted territory for us. Regardless of that, it seems like the issue pops up when using the sm component of the sharedfp framework. Did you try disabling that component?:

export OMPI_MCA_sharedfp="^sm"

Alternatively, enable the other components (lockedfile or individual) exclusively.

Not claiming that is a fix, but could be a step forward. Are you by any chance doing IO from GPU buffers? I wonder if that could play a role.

Regarding the OTPO tool: My understanding is that this is a user-space tool that everyone can use to tweak OpenMPI for their particular cases (ie: no admin intervention necessary to benefit from it). I would be interested in learning more about it, but being realistic the chances that I can take a deep dive are slim.

I typically ignore github notifications nowadays, since I am not actively involved in github projects anymore. I will keep an eye on this one for the next couple of days, but feel free to ping me via other channels if I don't react.

ax3l commented 2 years ago

Thank you @damianam and thanks for chiming in!

@MaxThevenet can you try this?

export OMPI_MCA_sharedfp="^sm"

And @damianam you say we should try

export OMPI_MCA_sharedfp="lockedfile"

and

export OMPI_MCA_sharedfp="individual"

as alternative strategies?

Are you by any chance doing IO from GPU buffers? I wonder if that could play a role.

We absolutely are at the moment, yes. #2097

Is your MPI GPU-aware? Otherwise we could also experiment with adding --mca mpi_leave_pinned 0 to mpiexec/srun - I saw pinning race issues in I/O in the past with PIConGPU.

AndiH commented 2 years ago

Our MPI is CUDA-aware, yes; you can see in the UCX_TLS variable which @damianam grepped above.

MaxThevenet commented 2 years ago

Thanks for looking into it! I tried

export OMPI_MCA_sharedfp="^sm"
srun -n 4 --cpu_bind=sockets $HOME/src/warpx/build/bin/warpx.3d.MPI.CUDA.DP.OPMD inputs &> output_sm.txt
export OMPI_MCA_sharedfp="lockedfile"
srun -n 4 --cpu_bind=sockets $HOME/src/warpx/build/bin/warpx.3d.MPI.CUDA.DP.OPMD inputs &> output_lf.txt
export OMPI_MCA_sharedfp="individual"
srun -n 4 --cpu_bind=sockets $HOME/src/warpx/build/bin/warpx.3d.MPI.CUDA.DP.OPMD inputs &> output_in.txt

But all runs failed with similar errors.

MaxThevenet commented 2 years ago

If it helps, I can provide instructions to install the code etc. to make a simple and quick reproducer (although the main files are already in the issue description). For now, I found a workaround so users can keep going: use ADIOS2 output rather than HDF5. I installed ADIOS2 from source in my $HOME, and this seems to be working well.

ECP-WarpX / WarpX

Crash in parallel simulations when flushing particle data on JUWELS Booster #2542