EPiGRAM-HS / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2 stars 1 forks source link

rdma_find_region_containing #18

Open lcebaman opened 4 years ago

lcebaman commented 4 years ago

Thank you for taking the time to submit an issue!

Background information

When calling (pnbc_osc.c):

res = handle->win->w_osc_module->osc_get(buf1, trygetargs.origin_count,
          trygetargs.origin_datatype,
          trygetargs.target, 0, trygetargs.target_count,
          trygetargs.target_datatype, handle->win);

MPI cannot find a region (osc_rdma_dynamic.c)

 *region = ompi_osc_rdma_find_region_containing (regions, 0, region_count - 1, (intptr_t) base, bound, module->region_size, NULL);

Details of the problem

Error calling MPI_Get inside the Allreduce algorithm.

dholmes-epcc-ed-ac-uk commented 4 years ago

That message suggests MPI cannot find a region in the target window with offset==0 and size==(extent of target_datatype) * (target_count)

  1. Is the offset correct? Has any memory been attached by the target process? Make sure the target process calls MPI_WIN_ATTACH with some allocated buffer space.

  2. is the size correct? How big is the bit of memory attached to the window by the target process? Print out the size used in the MPI_WIN_ATTACH call and check it against the size needed (use MPI_Type_getextent or similar to find the extent of target_datatype for the calculation of needed size).

If the target process attaches memory of a suitable size, then we need a deeper investigation!

lcebaman commented 4 years ago

It turns out that handle->win->w_osc_module->osc_get(buf1, trygetargs.origin_count, trygetargs.origin_datatype, trygetargs.target, 0, trygetargs.target_count, trygetargs.target_datatype, handle->win) uses a default 0 displacement. This is wrong as a displacement is required.

dholmes-epcc-ed-ac-uk commented 4 years ago

Ah, we're using a dynamic window, right? That means the "base" address is MPI_BOTTOM (typically address 0, in kernel space!) and all "displacements" are actually memory addresses - that is, displacements relative to MPI_BOTTOM.

The internal computation will be something like: address_of_region = base_of_window + disp_given_by_user

For a non-dynamic window the "base" for the window is the memory address of the first allocated byte in the window, so "disp" is really a displacement within the window, i.e. disp=0 means "the first byte in the window".

For a dynamic window the "base" for the window is always 0, so a "disp" of 0 is always well outside the window region.

--

Fix: you will need to know the remote address for the "base" of the window, i.e. the memory address of the chunk of memory that the remote process gave to MPI_WIN_ATTACH.

You could do an MPI_ALLGATHER to get all of the addresses of all chunks from all processes, but that is massively overkill. You only need the addresses from the processes that this process will MPI_GET data from. For our tree algorithm that means each process only needs at most 3 remote addresses (from the 2 children and the 1 parent).

Obtaining those base addresses should be done as part of the initialisation of the persistent collective operation. I would do it via point-to-point, I think. Use nonblocking point-to-point so the sends/recvs can be added to the schedule. Add those steps to the schedule before the restart point in the schedule, so they only get done once when the operation initialises, not every time the operation is (re)started.

lcebaman commented 4 years ago

Currently, this is what we do:

 /* Create dynamic window */
res = ompi_win_create_dynamic(&info->super, comm, &win);
 ....
  /* Attach window to tmp sendbuf */
 res = win->w_osc_module->osc_win_attach(win, tmpsbuf, count*size);
.....
  /* MPI Get_address  */
  MPI_Get_address (tmpsbuf, &disp);
....
  /* create an array of displacements where all ranks will gather the value*/
  disp_a = (MPI_Aint*)malloc(count * sizeof(MPI_Aint));
....
  /* Gather ALL displacements-> pt2pt call */
  res = comm->c_coll->coll_allgather(&disp, 1, MPI_INT, disp_a, count, MPI_INT, comm,
                                     comm->c_coll->coll_allgather_module);

Which I think is right?, but disp_a is never used later on!

dholmes-epcc-ed-ac-uk commented 4 years ago

The allgather is overkill (you don't need all those values, and the memory requirement for storing them scales linearly with the number of processes, rather than being fixed by the number of direct neighbours in the protocol pattern, i.e. 3 in our case). Also, this looks like a blocking allgather, which would ideally (eventually) be replaced with nonblocking something.

However, yes - this disp_a will contain the values you need to fix the "region not found" error.

lcebaman commented 4 years ago

I agree, we do not need all disp, but which ones do we need?

dholmes-epcc-ed-ac-uk commented 4 years ago

The ones for the parent process and the two children processes. Deal with that later :)

First fix the region not found by using disp_a[trygetargs.target_rank] (hopefully trygetargs.target_rank is already set correctly somewhere early on, e.g. during initialisation of the operation?)