EPiGRAM-HS / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2 stars 1 forks source link

ompi_win_create_dynamic hang #17

Closed lcebaman closed 4 years ago

lcebaman commented 4 years ago

Processes wait for ever waiting for ompi_win_create_dynamic creation. My debugging has taken me to:

ompi_win_create_dynamic -> ompi_osc_base_select -> best_component -> osc_select -> ompi_osc_rdma_component_select -> ompi_comm_dup -> ompi_comm_nextcid_nb

rc = ompi_comm_nextcid_nb (newcomm, comm, bridgecomm, arg0, arg1, send_first, mode, &req);
    if (OMPI_SUCCESS != rc) {
        return rc;
    }

ompi_request_wait_completion (req);

Any idea of what can be causing this? Note that we are not using the non-blocking version of create_dynamic, i.e. this routine has not been modified by us.

dholmes-epcc-ed-ac-uk commented 4 years ago

My debug steps (in rough order) would be:

  1. Make sure all processes in comm actually get to this line of code.
  2. What is the value of send_first on each process? If they all send first (or last) then maybe there's a deadlock?
  3. Print out all the other argument values for the ompi_comm_nextcid_nb call at each process. See if they make sense?
  4. Replace the top-level/user-level function (i.e. replace MPI_Allreduce_init with MPI_Win_create_dynamic in your test code) - assume that would "work" - it would create a window, go back to the user code in the test program, and then break for obvious other reasons. If so, what argument values were supplied to the internal ompi_comm_nextcid_nb call at each process when it is called in the call-tree from MPI_Win_create_dynamic? If there is a difference, why?

Does that help?

lcebaman commented 4 years ago

I can't see any obvious problems calling MPI_Win_create_dynamic inside and outside our code however I suspect that some sort of communicator recursiveness is happening and hence hanging. It is clear that it is only inside libpnbc_osc when the code hangs but I cannot see how to stop this behaviour so far.

lcebaman commented 4 years ago

I don't understand why the code jumps back to ompi_coll_libpnbc_osc_iallreduce recur