Need nonblocking window creation mechanism for persistent collective implementation

dholmes-epcc-ed-ac-uk commented 5 years ago

Goal: layering persistent collectives (with a nonblocking initialisation function) on top of OSC functionality.

Problem: all window creations functions are blocking.

Suggestion: the input and output buffers for persistent collective operations are supplied to the initialisation function by the user - this suggests that we will only need dynamic RMA windows, i.e. we would like MPI_WIN_ICREATE_DYNAMIC to exist.

Detail: The top-level MPI_Win_create_dynamic (the current blocking function) queries for the best OSC component, based on the given input parameters, and selects it. Selecting the OSC component calls ompi_base_select->osc_select->whatever_function_the_component_choses. All components duplicate the input communicator - we know that step can be done nonblocking because MPI_Comm_idup exists.

The sm component only deals with windows with flavor of shared memory - so it is not in scope for this issue.
The pt2pt component is layered on top of MPI point-to-point functionality - so it not in scope for this issue (because the existing implementation of persistent collectives is already layered directly on top of MPI point-to-point).
The Portals component also does comm_bcast, comm_allreduce, and several functions called PtlCTAlloc, PtlMDBind, and PtlMEBind. We know that the comm collectives have nonblocking counterparts but what do the Ptl* functions do?
The UCX component also does comm_allreduce, comm_allgather, comm_barrier, and a function called opal_common_ucx_wpctx_create. We know that the comm collectives have nonblocking counterparts but what does the opalucx function do?
The RDMA component looks like the best candidate for our purposes; although, we need to investigate what the ompi_oscrdma* functions do.

Notes:

We should assume that persistent collective code will always specify the same_dispunit INFO key.
We should assume that persistent collective code will often specify the same_size INFO key - except for the vector variants, e.g. MPI_ALLTOALLV.
We will need to chose between using PSCW and passive target in the persistent collective code

dholmes-epcc-ed-ac-uk commented 5 years ago

This is not strictly necessary because the definition of the persistent collective initialisation functions allows them to be blocking and/or synchronising - so calling the existing MPI_Win_create_dynamic function inside the MPI_coll_INIT function implementation is permitted.

The definition of the persistent collective initialisation functions is likely to change in future (to "must be local"), which is why this issue should be investigated and, hopefully, solved.

dholmes-epcc-ed-ac-uk commented 5 years ago

@lcebaman ignore this issue for now.

lcebaman commented 5 years ago

Ideally we would have a parameter here that indicates if the component is nonblocking or not:

int ompi_osc_base_select(ompi_win_t *win,
                         void **base,
                         size_t size,
                         int disp_unit,
                         ompi_communicator_t *comm,
                         opal_info_t *info,
                         int flavor,
                         int *model);

Refactoring this function could be risky (called by many other functions). Should we create an ompi_osc_base_iselect instead?

lcebaman commented 5 years ago

I retract my statement, ompi_osc_base_select is only called from win.c so I think it is worth choosing blocking/nonblocking inside it.

lcebaman commented 5 years ago

I am concerned about this function after calling to MPI_Comm_idup

 /* find rdma capable endpoints */
  ret = ompi_osc_rdma_query_btls (module->comm, &module->selected_btl);

It is likely that we need to wait until idup is complete to call this function. Are those endpoints identical in oldCOMM and newCOMM? If so, we could then query btls with the old communicator.

dholmes-epcc-ed-ac-uk commented 5 years ago

I think that call is attempting to select one of the many BTL modules that provides RDMA functionality. I think the code comment should say "modules" rather than "endpoints".

You cannot use any communicator before it is fully created. The MPI_COMM_IDUP for comm must be complete before you can use comm in any other calls.

Thus, you are right that this call to find a BTL module must be delayed until the communicator duplication has completed.

EPiGRAM-HS / ompi

Need nonblocking window creation mechanism for persistent collective implementation #2