flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
159 stars 49 forks source link

MPI implementation specific issues reported by SNL #3837

Open ryanday36 opened 2 years ago

ryanday36 commented 2 years ago

Eric Illescas (SNL) reports issues with using different MPI implementations than the implementation present during the build:

The flux version that ships with TOSS3 is MVAPICH centric and OpenMPI applications didn’t work. I did try srun ..–mpi=none, but I had inconsistent behavior and I gave up. Flux version w/ TOSS3

commands:                        0.25.0
libflux-core:                        0.25.0
libflux-security: 0.4.0
build-options:                    +hwloc==1.11.0
~/fluxtmp

I downloaded the master branch and built it locally with our default OMPI/1.x environment.

commands:               0.28.0-120-gfac664f
libflux-core:           0.28.0-120-gfac664f
broker:                 0.28.0-120-gfac664f
FLUX_URI:               local:///tmp/flux-Gk60kA/local-0
build-options:          +hwloc==1.11.0

Master branch: OMPI/1.x worked (default); IntelMPI/2018, MVAPICH2, OMPI/2.0 and OMP/4.0 did not;

He also reported a problem running multiple MPI applications on the same node:

MPI applications on different nodes worked. MPI applications sharing the same node hung. I suspect a common communicator (MPI_COMM).

I'll see if I can reproduce this on LC clusters / environment.

garlick commented 2 years ago

I sympathize and we really want to get to the point where the out of the box experience is good for all the mpis seen in the wild. However, it is challenging.

A couple of notes

garlick commented 2 years ago

It would probably be good to open issues on each version of each vendor mpi that you have problems with.

Edit: open issues here, I mean, not in the mpi projects, unless it's their bug :-)