LLNL / conduit

Simplified Data Exchange for HPC Simulations
https://software.llnl.gov/conduit/
Other
207 stars 64 forks source link

Expand support for completing non-blocking requests #989

Open gunney1 opened 2 years ago

gunney1 commented 2 years ago

I'd like to have more flexible ways to complete conduit::relay::mpi::Requests from non-blocking communications. Specifically,

  1. support the way applications store Requests and
  2. provide a choice to use MPI_Waitsome and MPI_Testsome to complete the MPI side of the requests.

Conduit "wait" functions requires a C array of Requests. Building such an array requires reallocating memory when we run out of space. However, reallocating memory would invalidate the pointers given to MPI. We use a list<Request> in Axom so we don't invalidate any pointers to add more Requests. If the Conduit "wait" functions allowed Request *requests[] instead of just Request requests[] we could use them. We're currently writing our own "wait" code, but since we're using conduit::relay::mpi::Request, it would make sense to let Conduit complete the request.

The second part is for conduit to support waitsome and testsome to check requests. The first always completes at least 1 outstanding request (if there are any) and the second is non-blocking. We currently do it this way:

void check_send_requests(std::list<conduit::relay::mpi::Request>& isendRequests,
                         bool atLeastOne) const
{
  std::vector<MPI_Request> reqs; // Intermediate request container for MPI.
  for(auto& isr : isendRequests) reqs.push_back(isr.m_request);

  int inCount = static_cast<int>(reqs.size());
  int outCount = 0;
  std::vector<int> indices(reqs.size(), -1);
  if(atLeastOne)
    MPI_Waitsome(inCount,
                 reqs.data(),
                 &outCount,
                 indices.data(),
                 MPI_STATUSES_IGNORE);
  else
    MPI_Testsome(inCount,
                 reqs.data(),
                 &outCount,
                 indices.data(),
                 MPI_STATUSES_IGNORE);
  indices.resize(outCount);
  // Remove the completed requests.  (Since they are sends, nothing else is requires.)
  auto reqIter = isendRequests.begin();
  int prevIdx = 0;
  for(const int idx : indices)
  {
    for(; prevIdx < idx; ++prevIdx) ++reqIter;
    reqIter = isendRequests.erase(reqIter);
    ++prevIdx;
  }
}

Obviously, if conduit support lists of Requests, we wouldn't need to generate the intermediate container to pass to the wait functions. But there's probably no way to do this without templates.

cyrush commented 2 years ago

Instead of an all in one method, how about adding wait_some and test_some methods.

Would it be possible to also support this for recv, if we use std::list as the primary interface for passing the data?

gunney1 commented 2 years ago

Sorry for the late response.

Yes, I think if you support wait_some and test_some AND std::list as the primary interface for passing data, it would be what we need.

cyrush commented 2 years ago

@gunney1

Great, I started working on this, a question:

Your method checks and when the requests are fulfilled, and removes fulfilled ones those from the list. (We can detect the send vs recv case and do the proper node cleanup)

For the general case - do we want to know which ones completed?

Do we want:

int wait_some(std::list<Request> &requests)

For rounds of communication.

or something like:

int wait_some(std::list<Request> &requests,
                       std::vector<int> &out_indices)

If we want to take some action on the Nodes between wait_some calls. It seems like this would be necessary. But if we shrink the list, the out_indices won't correspond to the adjusted list.

gunney1 commented 2 years ago

It may be possible that I want to have the completed requests around for some reason, although I probably don't care about the out_indices. What do you think of moving the completed requests out of the input list and placed into another list to return? Something like

list<request> check_recv_requests(list<request>& inRequests)
{
  list<request> completed;
  list<request>::iterator aCompletedRequest = ...;
  completed.splice(completed.end(), inRequests, aCompletedRequest);
  ...
  return completed;
}

Would that work? Of course, you can also have both lists in the argument and return void.

cyrush commented 2 years ago

Yes, that's a great idea! Style wise I would like to pass a ref to the list for the output as well.

cyrush commented 1 year ago

@gunney1 Q: Did you look at https://github.com/LLNL/conduit/blob/1255e71b78314979c2484c0d553361ffdebbe849/src/libs/relay/conduit_relay_mpi.hpp#L281

I think it's different comm pattern than you are using -- but it provides a very nice general interface for isends and irecvs.

gunney1 commented 1 year ago

@cyrush Very sorry for taking so long. After calling execute, how would I determine which non-blocking operation completed?

gunney1 commented 4 months ago

@cyrush Just getting back to this. I'm still interested in this feature. I'd like to be able to specify whether I'm willing to wait for at least one to finish or if I'm just checking whether any has actually finished. Basically, whether to use MPI_Waitsome or MPI_Testsome. Testsome is used when I'm not willing to wait, but I will take whatever has completed. None if that's the case. Waitsome is used when I'm willing to wait the minimum time to get at least one completion. I don't necessarily want to wait for all to finish, because I want to interleave work and wait. In my implementation, I have parameter bool atLeastOne, that says whether I'm willing to wait for at least one to complete. The options would allow me to maximize the amount of local computation I can do while waiting.