ecmwf / fckit

A Fortran toolkit for interoperating Fortran with C/C++
https://confluence.ecmwf.int/display/fckit
Apache License 2.0
29 stars 13 forks source link

Reuse of comms not working #8

Closed mmiesch closed 4 years ago

mmiesch commented 5 years ago

[not related to #7 - from @danholdaway]

Line 1106 of fckit_mpi.py is commented out and this seems to be causing an issue with reusing comms. The following pseudo code does not work for me:

type(fckit_mpi_comm) :: f_comm_new

call mpi_comm_split(f_comm%communicator(), geom%ntile, f_comm%rank(), commnew, ierr)
f_comm_new = fckit_mpi_comm(commnew)

call f_comm_new%delete()
call MPI_Comm_free(commtile, ierr)

call mpi_comm_split(f_comm%communicator(), geom%ntile, f_comm%rank(), commnew, ierr)
f_comm_new = fckit_mpi_comm(commnew)
print*, f_comm_new%rank()

This is because the second f_comm_new = fckit_mpi_comm(commnew) does not return a valid pointer. My guess is this because nothing really happens in %delete so the second fckit_mpi_comm(commnew) call does not set a pointer to the comm created in the second mpi_comm_split since it thinks it already exists but call MPI_Comm_free(commtile, ierr) removed the place it needs to point to. Interestingly only GCC compilers pick up on this issue.

If I uncomment line 1106 then I get an immediate seg fault in applications

wdeconinck commented 5 years ago

@mmiesch, @danholdaway In your example, is commtile equal to commnew ? I assume it is. Let's say I run following working code based on your example with 4 MPI tasks:

 1:  use fckit_module
 2:  implicit none
 3:  #include "mpif.h"
 4:  type(fckit_mpi_comm) :: f_comm, f_comm_new
 5:  integer :: commnew, ierr, ntile
 6:
 7:  f_comm = fckit_mpi_comm("world")
 8:  ntile = modulo( f_comm%rank() , 2 )
 9:
10:  call mpi_comm_split(f_comm%communicator(), ntile, f_comm%rank(), commnew, ierr)
11:  print*, "A ", f_comm%rank(), " commnew = ",commnew 
12:
13:  f_comm_new = fckit_mpi_comm(commnew)
14:
15:  call f_comm_new%delete()
16:  call MPI_Comm_free(commnew, ierr)
17:
18:  call mpi_comm_split(f_comm%communicator(), ntile, f_comm%rank(), commnew, ierr)
10:  print*, "B ", f_comm%rank(), " commnew = ",commnew
20:
21:  f_comm_new = fckit_mpi_comm(commnew)
22:  print*, f_comm_new%rank()

What happens is that in line 10, the new communicator is created. For me line 11 then prints for each rank

A <rank> commnew = 3

As you correctly state, line 15 (the delete) will do nothing and we can safely remove this line. Line 16 will free the communicator, but fckit/eckit will have no knowledge of this deletion. So whereas commnew has been set to MPI_COMM_NULL in this code, fckit/eckit still think there exists a communicator with value 3. In line 18 the communicator is recreated, but if the value commnew is again 3 (as it is in my case when I run it), f_comm_new in line 21 won't actually be the newly created one but the communicator that existed before. It is dependent likely on the MPI implementation that this reuses the same memory or not. It is definitely undefined behaviour.

To really fix this, we need to implement and expose the MPI_Comm_free method somehow. This requires modifications to eckit as well. I can follow this up.

Side note, in eckit we already have an implementation of the split functionality, which is just not exposed to the Fortran in fckit. The idea would be not to need mpi itself.

danholdaway commented 5 years ago

Thanks for the follow up @wdeconinck. Forgot to change one commtile so indeed they should all be commnew. We're building a list of things that exist in eckit mpi that we want to add to fckit mpi so will add split to that list and would be happy to do free once you have in eckit.

wdeconinck commented 5 years ago

This has been implemented in fckit version 0.6.3, in combination with eckit 1.0.0 (or latest develop)