flux-framework / flux-pmix

flux shell plugin to bootstrap openmpi v5+
GNU Lesser General Public License v3.0
2 stars 4 forks source link

Comments/suggestions for registering an nspace #58

Open rhc54 opened 2 years ago

rhc54 commented 2 years ago

Looking at the issues list, it appears you may have been basing your work on the Slurm PMIx plugin. Understandable, but actually that is a poor choice as it (a) is missing things that OMPI and CH4 needs/wants, and (b) actually has some errors in it. We've been trying for some time to get them to fix it, without success. Both OMPI and CH4 will run with that plugin - they both just have to make assumptions that may reduce performance or require a greater exchange of information at startup.

FWIW: OMPI will not fallback to a data exchange. If it cannot get info via PMIx, it simply makes a lowest denominator assumption. I don't know CH4's policy, but suspect it does the same based on my last view of it.

The better "model" would be to use what is in PRRTE - see the code here. The Standard itself is also a good source for what procs are expecting to see (relevant section is here).

I know you don't currently support MPI Sessions, Spawn, and fault tolerance, so we can safely ignore those related values. The PMIx library will "backfill" some things - e.g., we internally will use the regex maps you provide to compute the local peers, local size, etc if you don't give them to us. However, there are some things we cannot backfill for you that both programming libraries can use, if you have the ability to provide them.

For example, if you know the topology of each node, then you could pass down the cpuset of each process in the job and its device distances for NICs on its node (there is a PMIx API for computing the distances for you). This allows each process to determine which interface its peer will be using, which in turn allows for things like collective optimization (CH4 uses this - OMPI probably will soon as well).

It is also good to let PMIx help you setup the application via the PMIx_server_setup_application API. This is where the support for Slingshot and friends is done. It needs to be executed by the equivalent of mpirun and the returned data included in the "blob" sent to the daemons that will be hosting application processes. If you pass in the programming library name (e.g., "ompi" or "mpich"), PMIx will add whatever supporting info those libraries require/desire. In addition, we can assign network security keys and even assign endpoints (CH4 particularly uses that feature). There are a couple of APIs you would need to call on your backend daemons when fork/exec'ing the application processes - I can walk you thru them if you want to pursue it.

Again, I can point you to the relevant PRRTE areas if you want to use them as a guide - or I'm happy to provide advice. I suspect the "setup application" and distributing the returned info is the main thing that will impact you as otherwise you'd have to generate the info yourself that each library wants - and that can be a bit of a moving target, which is why we built it into PMIx so everyone got updated at the same time.

garlick commented 2 years ago

Thanks @rhc54, I actually did refer to openpmix, prrte, ompi v5.0.x, and slurm pmix plugins plus multiple pmix spec versions while writing this, but it was an uphill battle to spelunk and reconcile all that, so your guidance could be pretty useful at this point.

My goal was first to supply the bare minimum needed to bootstrap ompi v5, performance be damned, and then make incremental changes that have tangible benefits now, while tracking the latest openpmix releases. It sounds like you've identified

(Edit: that last issue ref was the wrong one, I guess i don't have one but I meant to refer to supporting spectrum MPI, which does impact us.)

Anything else?

One thing I struggled with was the nspace setup. I ended up with a trial and error process to find things that actually seemed to make a difference, and where I landed was a vastly smaller set than prrte for example. Are there specifics in that area that I should pay attention to? Since my testing was mainly on a single system, I'm quite sure there is more to do.

Other random comments

rhc54 commented 2 years ago

Where is the current CH4 work (with pmix support) occuring? I didn't see much pmix support in mpich main branch on github

Intel is working on it for Aurora - I'm not sure how much of the code is public yet. I suspect Argonne controls that, so perhaps Ken could better answer any questions about it. There is substantial PMIx integration in it, including full support for "instant on".

Where is ompi+slingshot development occurring? I'm wondering if our people working on the CORAL2 procurement might like to be aware of that (if they aren't already).

The integration is actually at the PMIx level, not OMPI, and is being done by Intel (again, for Aurora). Idea was to only have to implement fabric support once and then be able to reuse it across host environments (we have multiple fabric plugins). The plugin will assign endpoints, provide device distances for all procs and NICS in the job, and assign security keys - basically, everything you need to do an "instant on" startup. However, it requires that you use the PMIx_server_setup_application (and backend friends) to make it work.

Is there anything in the way of pmix compliance testing that we could add to the CI for this project?

IBM has an entire CI suite they execute on PMIx and PRRTE - you might talk to Josh about it. I'm pretty sure all the tests are in the pmix-tests repository. Josh can help explain how it gets used.

rhc54 commented 2 years ago

One thing I struggled with was the nspace setup. I ended up with a trial and error process to find things that actually seemed to make a difference, and where I landed was a vastly smaller set than prrte for example. Are there specifics in that area that I should pay attention to? Since my testing was mainly on a single system, I'm quite sure there is more to do.

Hmmm...I see the problem. Yeah, what is there isn't quite correct, but I can see how you might get to that point. Let me dig around - I believe I have documentation about the local proc spawn procedure.

rhc54 commented 2 years ago

PMIx_server_setup_application() (maybe that will help with configure does not grok IBM spectrum MPI #46?)

Hmmm...I'm not sure why you would need a configure test about Spectrum relative to this plugin? Is it because Spectrum needs some envars in order to operate, and you are trying to set them in flux? If so, then yes - we could add a "spectrum" plugin to PMIx and let PMIx handle all that so you don't need the complications in flux. It would be easy to do. You'd just need the user to tell you it is a "spectrum" app (e.g., --mpi=spectrum) and then pass that down to PMIx_server_setup_application - PMIx would take care of the rest.

garlick commented 2 years ago

Hmmm...I'm not sure why you would need a configure test about Spectrum relative to this plugin?

Sorry, my memory is confused on this - I don't think we've tried to launch spectrum applications with this plugin. We do it with the old flux MCA plugin in ompi but by not using pmix we lose support for some interconnect enhancements. Regardless of whether it helps with spectrum, it seems like we should do it and if I recall what I've seen in various examples it's not too hard.

I'll check out pmix-tests, thanks!

garlick commented 2 years ago

Oh and the purpose of the configure check for MPI mentioned in #46 is just for setting up the built-in testing, where we compile some built in MPI programs and make sure we can run them. It would be nice if we could check out his repo on the IBM system and run "make check" and show that it basically works with the native MPI there, albeit on one node only.

Bringing it up here was a non-sequitur, sorry!

rhc54 commented 2 years ago

Hmmm...I see the problem. Yeah, what is there isn't quite correct, but I can see how you might get to that point. Let me dig around - I believe I have documentation about the local proc spawn procedure.

This will take me a couple of days - need to put out a few fires. I'm writing it up as a "guide" on our wiki and will post the link here when done.

rhc54 commented 2 years ago

I'm sorry this took so long, but I finally put together a brief outline on the launch sequence:

https://github.com/openpmix/openpmix/wiki/PMIx-Launch-Sequence

Hopefully, that will help clarify things a bit. Please feel free to comment and/or make suggestions for improving it.

garlick commented 2 years ago

Thanks @rhc54.

I'll circle back and audit this code against your description as soon as I have time - thanks!

Meanwhile one thing that remains a mystery is what subset of the nspace I need for direct launching openmpi v5. The doc references the spec (and v4.0 when the current one is v4.1?) but I had a lot of difficulty trying to determine what the minimal subset is, as well as reconciling differences between spec versions, and between spec and actual code.

One quick question: the doc references the PMIX_PROGRAMMING_MODEL attribute for telling pmix to load a plugin specific to a version of MPI. That would seem to be an opportunity to get a hard failure if something isn't set up right for that MPI. Is that a true statement and does a model plugin exist for openmpi?

rhc54 commented 2 years ago

Meanwhile one thing that remains a mystery is what subset of the nspace I need for direct launching openmpi v5. The doc references the spec (and v4.0 when the current one is v4.1?) but I had a lot of difficulty trying to determine what the minimal subset is, as well as reconciling differences between spec versions, and between spec and actual code.

I'll create a wiki page that lists the minimal required, desired, and "nice to have" values. We used to have that but it got put somewhere non-obvious during the transfer of the pmix.org web site from me to the standard's group. Undoubtedly needs updating anyway.

One quick question: the doc references the PMIX_PROGRAMMING_MODEL attribute for telling pmix to load a plugin specific to a version of MPI. That would seem to be an opportunity to get a hard failure if something isn't set up right for that MPI. Is that a true statement and does a model plugin exist for openmpi?

Yeah, it can be a hard failure. The idea here is to relieve the RM of having to know how to handle every implementation of every programming model out there. If we put their needs into PMIx, then all the launchers and RMs that speak PMIx will automatically be able to support that model/implementation - and we only have to update one place when something changes.

OMPI does indeed have a plugin, as does mpich (not in a release branch yet, but coming soon), the openshmem reference implementation (from stony brook), and SOS (which NVIDIA is now transitioning into their own oshmem library). I've poked the IBM team about getting a spectrum plugin as they have some differences from base OMPI.

garlick commented 2 years ago

Thanks - any pointers to non release branches containing those plugins? (esp openmpi?)

rhc54 commented 2 years ago

OMPI is in the release branches - see https://github.com/openpmix/openpmix/blob/master/src/mca/pmdl/ompi/pmdl_ompi.c.

You can take a peek at the MPICH one as well if you like: https://github.com/openpmix/openpmix/blob/master/src/mca/pmdl/mpich/pmdl_mpich.c

garlick commented 2 years ago

Interesting. So maybe if I try building this plugin against openpmix master and set the model to ompi, I will achieve enlightenment? Or at least a hard error that I can track down in a methodical way?

rhc54 commented 2 years ago

I don't think that will work - IIRC, you call "setup_fork" before you call "register_nspace" - correct? It has to be the other way around, and you have to call "setup_fork" for each process before you spawn it as the envars pmdl/ompi make are proc-specific. I'm not sure if OMPI will bark about missing the information, or simply quietly ignore its lack (and lose some capability). Believe it is the former, but I honestly can't be sure any more.

garlick commented 2 years ago

I don't think that will work - IIRC, you call "setup_fork" before you call "register_nspace" - correct? It has to be the other way around, and you have to call "setup_fork" for each process before you spawn it as the envars pmdl/ompi make are proc-specific.

No, it's the other way around, though it may not be obvious because they are called from flux-specific callbacks: first, the shell.init callback for the pmix shell plugin calls register_nspace, then the task.init callback calls setup_fork for each task before it is forked.

I'm not sure if OMPI will bark about missing the information, or simply quietly ignore its lack (and lose some capability). Believe it is the former, but I honestly can't be sure any more.

The code in ompi that I audited before treated practically every pmix call failure as non-fatal, so bummer if this turns out to be the same. Maybe this is not going to be a great approach and this:

I'll create a wiki page that lists the minimal required, desired, and "nice to have" values. We used to have that but it got put somewhere non-obvious during the transfer of the pmix.org web site from me to the standard's group. Undoubtedly needs updating anyway.

is the way go go.

Thanks for taking the time to explain things.

rhc54 commented 2 years ago

I'm not sure what you mean about the approach. OMPI accesses PMIx info in a bunch of places. There is the "main" RTE init area in ompi_rte.c, but that just handles the global things OMPI wants. Various components inside OMPI obtain data via PMIx for stuff like determining if shared memory can be used, and if so, where the rendezvous file will be and who is creating it. Hence my statement that it might not bark, but you could lose performance.

The pmdl/ompi plugin doesn't provide all the data OMPI needs - it just provides some support that is more OMPI-specific (like the MPI-3 data in OMPI-specific names). The RM still has to provide the nspace registration data about the job, app, and procs - OMPI will definitely fail without those. But then again, so would MPICH and OSHMEM.

So you need both things in parallel if you want to fully support these libraries. PMIx provides the former as those are truly implementation specific, and there is no point in having every RM implement them. The latter is implementation-agnostic, and so it makes sense for the RM to do those.

garlick commented 2 years ago

I'm not sure what you mean about the approach.

I meant the approach for learning what nspace values that flux-pmix is missing are the most important to add for ompi v5 support. Some way of deducing those would be nice, but someone telling me would be fine.

If everything listed as "required" in the nspace section of the 4.1 spec is what I need to add then so be it, but it wasn't clear to me that this would be either necessary or sufficient to make ompi v5 bullet proof on a production HPC cluster. Given limited development time availability, I need to choose the next chunk of effort to make here with the most payoff.

So basically any guidance you can provide towards that end would be much appreciated.

rhc54 commented 2 years ago

Gotcha - will do!

rhc54 commented 2 years ago

I have updated the launch sequence page to include a list of required, optional, and bare minimum data. The bare minimum listing is here

garlick commented 2 years ago

Exactly what I was hoping for! Thanks so much.

rhc54 commented 2 years ago

I'll continue to fine-tune it a bit over the next few days - I think some of these are now provided for you by the PMIx server library (e.g., I think we provide the PMIX_TOPOLOGY_STRING so you don't need to explicitly do so). Just need to do some checking.