flux-framework / flux-coral2

Plugins and services for Flux on CORAL2 systems
GNU Lesser General Public License v3.0
9 stars 7 forks source link

Stability of Cray MPI plugin #109

Open mattaezell opened 11 months ago

mattaezell commented 11 months ago

The readme notes: The plugins and scripts in flux-coral2 are being actively developed and are not yet stable. Is the Cray MPI part more stable? Supporting Cray MPI is widely interesting (outside of CORAL-2), so I'm curious if that code makes sense to "graduate" into flux-core?

garlick commented 11 months ago

Hi Matt -

In general I think we're trying to avoid saddling flux-core with the weirdness that comes along for each advanced technology system, based on lessons learned with the Slurm code base over the years.

Note that there is still some pending work on support for Cray MPICH in the shasta stack:

And also note that we don't yet have this stack running in production, although we certainly have early adopters porting codes and running small jobs and such.

I believe Cray MPICH can also bootstrap with the "normal" libpmi2.so support offered by flux-core. In the shasta stack, it's not the way Cray wanted to go. Instead, Cray provides their own PMI implementation, which we have to bootstrap instead of directly bootstrapping Cray MPICH. (I apologize if this is old news to you - I know Oak Ridge has a long history with Cray!)

I think the CORAL-2 team is pretty focused on getting the rabbit support working right now for El Cap, so those MPI issues are on the back burner. The El Cap rollout demands are another reason why flux-coral2 is best kept in its own repo - it may need to change quickly and we don't want to have to push through a flux-core tag for every little thing that comes up on that schedule.

mattaezell commented 11 months ago

In general I think we're trying to avoid saddling flux-core the weirdness that comes along for each advanced technology system, based on lessons learned with the Slurm code base over the years.

Understood. I consider Cray MPI support a little more generic than ATS, but I get the point.

Note that there is still some pending work on support for Cray MPICH in the shasta stack:

* [libpals: improve port-distribution mechanism #28](https://github.com/flux-framework/flux-coral2/issues/28)

I don't think this is an issue with flux-under-slurm since 2 different jobs can overlap their ports

* [MPI: Integrate with HPE's CXI library for allocating VNIs #24](https://github.com/flux-framework/flux-coral2/issues/24)

VNIs will be a problem. With flux-under-slurm all the flux jobs (steps here) in a Slurm job will share the Slurm-provided VNI. This can be problematic for concurrent steps, as there will be a conflict with the PID_BASE. Since we don't have a "global" flux running, we wouldn't have an arbiter to pass out VNIs even if we had a privileged way of doing it.

And also note that we don't yet have this stack running in production, although we certainly have early adopters porting codes and running small jobs and such.

This would just be experimental to support some workloads that want to run more concurrent steps than slurmctld can sensibly handle.

I believe Cray MPICH can also bootstrap with the "normal" libpmi2.so support offered by flux-core. In the shasta stack, it's not the way Cray wanted to go. Instead, Cray provides their own PMI implementation, which we have to bootstrap instead of directly bootstrapping Cray MPICH. (I apologize if this is old news to you - I know Oak Ridge has a long history with Cray!)

Ah. I tried to flux run a binary using Cray mpich and it just hung. I'll play around with options to see if I can get it to work, as that would be preferred to pulling in this plugin if I don't need it.

I think the CORAL-2 team is pretty focused on getting the rabbit support working right now for El Cap, so those MPI issues are on the back burner. The El Cap rollout demands are another reason why flux-coral2 is best kept in its own repo - it may need to change quickly and we don't want to have to push through a flux-core tag for every little thing that comes up on that schedule.

Understood. Thanks for the info so far.

garlick commented 11 months ago

VNIs will be a problem. With flux-under-slurm all the flux jobs (steps here) in a Slurm job will share the Slurm-provided VNI. This can be problematic for concurrent steps, as there will be a conflict with the PID_BASE. Since we don't have a "global" flux running, we wouldn't have an arbiter to pass out VNIs even if we had a privileged way of doing it.

Hmm, is it possible to disable VNIs for flux jobs (taking the place of slurm job steps) to get around this temporarily? Or is this a complete show stopper right now?

Ah. I tried to flux run a binary using Cray mpich and it just hung. I'll play around with options to see if I can get it to work, as that would be preferred to pulling in this plugin if I don't need it.

I can do a little testing on our end to see how far I get here with that. You might need to specify -opmi=simple,libpmi2 on the flux run command line. -overbose=2 on a 2 task job is sometimes useful for getting a PMI trace. Finally, one potential source of problems could be environment variables set by slurm "leaking through" to the MPI programs started by flux and confusing them.

garlick commented 11 months ago

Oops I'm confusing my PMI client and server options. The above -opmi=simple,libpmi2 option is not going to help.

The goal shuld be to get the MPI program to find flux's libpmi2.so before cray's, so you might have to set LD_LIBRARY_PATH to point to flux's libdir. For example, on my test system:

LD_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu/flux flux run -n2 flux pmi -v --method=libpmi2 barrier
libpmi2: using /usr/lib/aarch64-linux-gnu/flux/libpmi2.so
libpmi2: initialize: rank=0 size=2 name=ƒciddyLYSZu: success
libpmi2: using /usr/lib/aarch64-linux-gnu/flux/libpmi2.so
libpmi2: initialize: rank=1 size=2 name=ƒciddyLYSZu: success
libpmi2: barrier: success
libpmi2: barrier: success
libpmi2: barrier: success
libpmi2: barrier: success
ƒciddyLYSZu: completed pmi barrier on 2 tasks in 0.000s.
libpmi2: finalize: success
libpmi2: finalize: success

flux pmi is a test client that is sometimes a useful stand-in for MPI when trying to verify simple things.

mattaezell commented 11 months ago

Hmm, is it possible to disable VNIs for flux jobs (taking the place of slurm job steps) to get around this temporarily? Or is this a complete show stopper right now?

I think it's an edge case that's only a problem for multi-node jobs sharing the same node. Most use cases are either small (sub-node, so many jobs on a node) or large (one job spans multiple nodes and fills each of them up).

The goal shuld be to get the MPI program to find flux's libpmi2.so before cray's, so you might have to set LD_LIBRARY_PATH to point to flux's libdir.

I reinstalled without the coral2 plugins, and that seemed to work:

ezy@borg041:~> export LD_LIBRARY_PATH=~/flux/install/lib/flux:$LD_LIBRARY_PATH
ezy@borg041:~> ~/flux/install/bin/flux run -N2 -n4 ~/mpi_hello/mpihi
Hello World from 0 of 4
Hello World from 1 of 4
Hello World from 3 of 4
Hello World from 2 of 4
ezy@borg041:~>

I don't know if Cray's PMI does anything special that flux's doesn't, but I'll head down this path. Thanks!

garlick commented 11 months ago

I think it's an edge case that's only a problem for multi-node jobs sharing the same node. Most use cases are either small (sub-node, so many jobs on a node) or large (one job spans multiple nodes and fills each of them up).

Makes sense - thanks.

I also did basically the same experiment you just did with a hello world program on one of our precursor systems and it worked ok.

Let us know if you run into more problems.

trws commented 5 months ago

Is this still an issue? I just realized, we have a known potential issue in not assigning VNIs which could cause problems specifically when enough ranks of a job share a node.

garlick commented 5 months ago

You mean is flux-framework/flux-coral2#24 (VNI support) still an issue? Still on the back burner AFAIK.

trws commented 5 months ago

I actually was wondering if we still see issues with multi-rank/multi-node, because the VNI issue could cause that by exhausting resources on the NIC.

garlick commented 5 months ago

I am not aware of that ever being a problem or I have forgotten. It'd be good to have a separate, new issue on that if it is a problem or is likely to become one.

trws commented 5 months ago

We talked about it when I came back from ADAC, the NICs use the VNI to separate resources, so if too many things run without setting one on the NIC it can lead to resource exhaustion. The MPI subdivides the range it's given, so normally that's ok, but if we're not bootstrapping the MPI normally or if a user runs more than one multi-node job across the same set of nodes (or a single shared node I suppose) then it could be an issue. I would have sworn we made an issue for it at the time, but my internet is failing right now and I'm just hoping I can post this. 😬 will either find it or make a new one