EPiGRAM-HS / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2 stars 1 forks source link

Need nonblocking window creation mechanism for persistent collective implementation #2

Open dholmes-epcc-ed-ac-uk opened 5 years ago

dholmes-epcc-ed-ac-uk commented 5 years ago

Goal: layering persistent collectives (with a nonblocking initialisation function) on top of OSC functionality.

Problem: all window creations functions are blocking.

Suggestion: the input and output buffers for persistent collective operations are supplied to the initialisation function by the user - this suggests that we will only need dynamic RMA windows, i.e. we would like MPI_WIN_ICREATE_DYNAMIC to exist.

Detail: The top-level MPI_Win_create_dynamic (the current blocking function) queries for the best OSC component, based on the given input parameters, and selects it. Selecting the OSC component calls ompi_base_select->osc_select->whatever_function_the_component_choses. All components duplicate the input communicator - we know that step can be done nonblocking because MPI_Comm_idup exists.

Notes:

dholmes-epcc-ed-ac-uk commented 5 years ago

This is not strictly necessary because the definition of the persistent collective initialisation functions allows them to be blocking and/or synchronising - so calling the existing MPI_Win_create_dynamic function inside the MPI_coll_INIT function implementation is permitted.

The definition of the persistent collective initialisation functions is likely to change in future (to "must be local"), which is why this issue should be investigated and, hopefully, solved.

dholmes-epcc-ed-ac-uk commented 5 years ago

@lcebaman ignore this issue for now.

lcebaman commented 5 years ago

Ideally we would have a parameter here that indicates if the component is nonblocking or not:

int ompi_osc_base_select(ompi_win_t *win,
                         void **base,
                         size_t size,
                         int disp_unit,
                         ompi_communicator_t *comm,
                         opal_info_t *info,
                         int flavor,
                         int *model);

Refactoring this function could be risky (called by many other functions). Should we create an ompi_osc_base_iselect instead?

lcebaman commented 5 years ago

I retract my statement, ompi_osc_base_select is only called from win.c so I think it is worth choosing blocking/nonblocking inside it.

lcebaman commented 5 years ago

I am concerned about this function after calling to MPI_Comm_idup

 /* find rdma capable endpoints */
  ret = ompi_osc_rdma_query_btls (module->comm, &module->selected_btl);

It is likely that we need to wait until idup is complete to call this function. Are those endpoints identical in oldCOMM and newCOMM? If so, we could then query btls with the old communicator.

dholmes-epcc-ed-ac-uk commented 5 years ago

I think that call is attempting to select one of the many BTL modules that provides RDMA functionality. I think the code comment should say "modules" rather than "endpoints".

You cannot use any communicator before it is fully created. The MPI_COMM_IDUP for comm must be complete before you can use comm in any other calls.

Thus, you are right that this call to find a BTL module must be delayed until the communicator duplication has completed.