Open dholmes-epcc-ed-ac-uk opened 5 years ago
This is not strictly necessary because the definition of the persistent collective initialisation functions allows them to be blocking and/or synchronising - so calling the existing MPI_Win_create_dynamic function inside the MPI_coll_INIT function implementation is permitted.
The definition of the persistent collective initialisation functions is likely to change in future (to "must be local"), which is why this issue should be investigated and, hopefully, solved.
@lcebaman ignore this issue for now.
Ideally we would have a parameter here that indicates if the component is nonblocking or not:
int ompi_osc_base_select(ompi_win_t *win,
void **base,
size_t size,
int disp_unit,
ompi_communicator_t *comm,
opal_info_t *info,
int flavor,
int *model);
Refactoring this function could be risky (called by many other functions). Should we create an ompi_osc_base_iselect
instead?
I retract my statement, ompi_osc_base_select
is only called from win.c so I think it is worth choosing blocking/nonblocking inside it.
I am concerned about this function after calling to MPI_Comm_idup
/* find rdma capable endpoints */
ret = ompi_osc_rdma_query_btls (module->comm, &module->selected_btl);
It is likely that we need to wait until idup is complete to call this function. Are those endpoints identical in oldCOMM and newCOMM? If so, we could then query btls with the old communicator.
I think that call is attempting to select one of the many BTL modules that provides RDMA functionality. I think the code comment should say "modules" rather than "endpoints".
You cannot use any communicator before it is fully created. The MPI_COMM_IDUP
for comm
must be complete before you can use comm
in any other calls.
Thus, you are right that this call to find a BTL module must be delayed until the communicator duplication has completed.
Goal: layering persistent collectives (with a nonblocking initialisation function) on top of OSC functionality.
Problem: all window creations functions are blocking.
Suggestion: the input and output buffers for persistent collective operations are supplied to the initialisation function by the user - this suggests that we will only need dynamic RMA windows, i.e. we would like MPI_WIN_ICREATE_DYNAMIC to exist.
Detail: The top-level MPI_Win_create_dynamic (the current blocking function) queries for the best OSC component, based on the given input parameters, and selects it. Selecting the OSC component calls ompi_base_select->osc_select->whatever_function_the_component_choses. All components duplicate the input communicator - we know that step can be done nonblocking because MPI_Comm_idup exists.
Notes: