Sandia-OpenSHMEM / SOS

Sandia OpenSHMEM is an implementation of the OpenSHMEM specification over multiple Networking APIs, including Portals 4, the Open Fabric Interface (OFI), and UCX. Please click on the Wiki tab for help with building and using SOS.
Other
61 stars 53 forks source link

PMI2 support assumes it initializes job resources #928

Open benson31 opened 4 years ago

benson31 commented 4 years ago

This block is missing an else statement to set the size and rank variables when someone else (e.g., MPI) has initialized PMI2 for this job.

This came up in a simple MPI+OpenSHMEM program:

#include <mpi.h>
#include <shmem.h>
int main()
{
  MPI_Init(0,0);
  shmem_init();
  shmem_finalize();
  MPI_Finalize();
  return 0;
}

(Inverting the order of shmem_init and MPI_Init caused a bigger mess involving FPEs and very large core dumps...)

The compiler was GCC@7.3.0 and MPI ws OpenMPI@4.0.0, launched with srun.

A workaround was to configure with --enable-pmi-mpi. After manually setting CPPFLAGS, LDFLAGS, and LIBS, everything compiled fine and this example ran.

davidozog commented 4 years ago

Thanks for the report @benson31.

Yeah, SOS only supports MPI+OpenSHMEM via the --enable-pmi-mpi flag, and only with the initialization/finalization order that's shown in your example. (Looks like this is not yet documented - I'm impressed you figured it out).

You might be our first MPI+OpenSHMEM user, so I'm curious about your use-case if you can say more... Our developer resources can be pretty limited for solving these sorts of problems. The PMI over MPI solution was actually an intern's summer project. I don't see a straightforward way to get rank and size here in an else block, but I'm not familiar with PMI2..

Also, manually setting LDFLAGS and friends usually isn't necessary since we try to leverage RPATH as much as possible for required libraries. Are you saying you need to do this for the MPI library flags?

benson31 commented 4 years ago

@davidozog I haven't had a chance to clean it up at all, so please forgive the debugging-centric nature of the code, but this patch worked for me:

    if (!PMI2_Initialized()) {
      // As you have it already
    }
    else
    {
        int flag, pmi_ret;
        char size_str[128];
        pmi_ret = PMI2_Info_GetJobAttr("universeSize", size_str, 128, &flag);
        if (PMI2_SUCCESS == pmi_ret)
        {
            if (flag)
                size = atoi(size_str);
            else
            {
                printf("No universeSize; the docs say this is predefined, so something's wrong.\n");
            }
        }
        else
        {
            printf("PMI2_Info_GetJobAttr(...) failed with error code: %d\n", pmi_ret);
        }
        pmi_ret = PMI2_Job_GetRank(&rank);
        if (PMI2_SUCCESS != pmi_ret)
            printf("ERROR: PMI2_Job_GetRank failed with error code: %d\n");
    }

I was able to run our test (a simple "hello world" from all PEs) using PMI2 with this patch in place. I didn't try to run any of the other tests since that was sufficient for the person who asked me for help with this (and none of them have MPI anyway). It would obviously need to be cleaned up to match your error-handling format, etc, but I didn't have time to dig into that.

Re: the use-case. I work on a large-scale machine learning code, LBANN, and we are experimenting with the OpenSHMEM model for certain operations. The current use-case comes from @timmoon10 and the relevant code block is here.

Re: flags. When building with --enable-pmi-mpi, I found that mpi.h was not found when building SOS, so I set the CPPFLAGS to include the MPI include path. Then when building the test suite, there are undefined references to MPI symbols, so I added -Wl,-rpath,${MPI_DIR}/lib -L${MPI_DIR}/lib -lmpi to LDFLAGS, where MPI_DIR is the prefix of your MPI installation. After this, the tests linked fine and make check had a bunch of PASSes before I killed it. I'm not an autotools developer (I've just gotten good at lying to it, because it's autotools) so I can't be of more help on that front.

davidozog commented 3 years ago

@benson31 Thanks for the all the info, and sorry it's taken so long to reply.

I'm running through our backlog of issues towards a new SOS release and this came up. Is your patch to PMI2 still in use for LBANN (with @timmoon10)? Do you require PMI2 support when doing MPI+SHMEM?

In other words, I'm a little uncertain as to whether the --enable-pmi-mpi approach works for you, or if this is still an open feature request. Can we close this issue?

If we do incorporate the patch, we should probably take the time to do the same for PMI 1.0 and PMIx, but it's not a high priority on our end because we (arguably?) conform to the OpenSHMEM v1.5 specification's interoperability requirements with the --enable-pmi-mpi approach. That and... we don't have any MPI+SHMEM users besides yourselves. ;) Seems like a very cool app though!

timmoon10 commented 3 years ago

The patch is preferable since @benson31's workaround with --enable-pmi-mpi requires a lot of mucking around with the environment. We've focused our efforts on using NVSHMEM, so this isn't a critical requirement for us now. It would be nifty in the future though if we run on non-GPU systems.

davidozog commented 3 years ago

Oh yeah! I forgot to mention that believe I figured out the environment issue - you have to set CC to the MPI compiler (mpicc in my case) when passing --enable-pmi-mpi.

The brief reason for that is the SOS runtime is making MPI calls when you enable PMI-MPI, so you have to build with the MPI compiler wrapper. We're working on documenting this now. My apologies that it's been so poorly documented.

Does setting CC=mpicc during configuration work on your end?

Very cool. Side question - I looked quickly through the LBANN code base for an example using NVSHMEM, and I only see configury sorts of stuff (and nvshmem_finalize). Any chance you can point me to the kernel making SHMEM calls? (I'm generally curious about these sorts of GPU-initiated communication use cases).

benson31 commented 3 years ago

@davidozog Thanks for following up on this. I'll need a little time to revive all these tests and remember what I was doing, but when I do, I'll give CC=mpicc a shot.

Re NVSHMEM in LBANN, this file has the only real use-case in LBANN.

davidozog commented 1 year ago

@benson31 and @timmoon10 - have you tried compiling with CC=mpicc when doing --enable-pmi-mpi?

I want to verify that the environment issues only came about because building with CC=mpicc is a requirement when enabling PMI over MPI in SOS.

I think we can close this issue, but it would be great if you could verify. The requirement is now documented on our troubleshooting wiki.

Can you think of any reasons to include the above patch? One reason I'm skeptical is because hard-setting the runtime-pmi2 size to "universeSize" seems like it's specialized to SHMEM+MPI hybrid executions, not the PMI2 runtime layer itself. It could break PMI2 executions, right? For hybrid programs, SOS's supported path is through --enable-pmi-mpi.