Closed lcebaman closed 4 years ago
My debug steps (in rough order) would be:
comm
actually get to this line of code.send_first
on each process? If they all send first (or last) then maybe there's a deadlock?ompi_comm_nextcid_nb
call at each process. See if they make sense?MPI_Allreduce_init
with MPI_Win_create_dynamic
in your test code) - assume that would "work" - it would create a window, go back to the user code in the test program, and then break for obvious other reasons. If so, what argument values were supplied to the internal ompi_comm_nextcid_nb
call at each process when it is called in the call-tree from MPI_Win_create_dynamic
? If there is a difference, why?Does that help?
I can't see any obvious problems calling MPI_Win_create_dynamic
inside and outside our code however I suspect that some sort of communicator recursiveness is happening and hence hanging.
It is clear that it is only inside libpnbc_osc
when the code hangs but I cannot see how to stop this behaviour so far.
I don't understand why the code jumps back to ompi_coll_libpnbc_osc_iallreduce
Processes wait for ever waiting for
ompi_win_create_dynamic
creation. My debugging has taken me to:ompi_win_create_dynamic -> ompi_osc_base_select -> best_component -> osc_select -> ompi_osc_rdma_component_select -> ompi_comm_dup -> ompi_comm_nextcid_nb
Any idea of what can be causing this? Note that we are not using the non-blocking version of
create_dynamic
, i.e. this routine has not been modified by us.