Mantevo / miniFE

MiniFE Finite Element Mini-Application
http://www.mantevo.org
GNU Lesser General Public License v3.0
28 stars 31 forks source link

MPI_Irecv and MPI_Send use the same buffer at the same time #18

Open mawi2017 opened 1 year ago

mawi2017 commented 1 year ago

Hi,

I ran miniFE's ref version with Intel MPI under the message checker from ITAC (Intel Trace Analyzer and Collector). The message checker detected issues LOCAL:MEMORY:OVERLAP and further LOCAL:MEMORY:ILLEGAL_MODIFICATION in ref/src/make_local_matrix.hpp where the same buffers are used for sending and receiving at the same time. From what I saw all other minFE's version should also be affected if they execute the corresponding code.

The affected code from ref/src/make_local_matrix.hpp is in lines 257ff:

  std::vector<MPI_Request> request(num_send_neighbors);
  for(int i=0; i<num_send_neighbors; ++i) {
    MPI_Irecv(&tmp_buffer[i], 1, mpi_dtype, MPI_ANY_SOURCE, MPI_MY_TAG,
              MPI_COMM_WORLD, &request[i]);
  }

  // send messages

  for(int i=0; i<num_recv_neighbors; ++i) {
    MPI_Send(&tmp_buffer[i], 1, mpi_dtype, recv_list[i], MPI_MY_TAG,
             MPI_COMM_WORLD);
  }

If both loops have a trip count > 0 then some buffers pointed to by the tmp_buffer array are used at the same time for sending and receiving.

The complete output and commands for reproducing:

$ git clone https://github.com/Mantevo/miniFE.git
$ cd miniFE/ref/src
$ # loaded module for intelmpi and itac
$ make
$ mpiexec -check-mpi -n 2 ./miniFE.x
...
      creating/filling mesh...0.000828028s, total time: 0.000828981
generating matrix structure...0.00868297s, total time: 0.00951195
         assembling FE data...0.00850797s, total time: 0.0180199
      imposing Dirichlet BC...0.00221992s, total time: 0.0202398
      imposing Dirichlet BC...0.00244904s, total time: 0.0226889
making matrix indices local...
[0] WARNING: LOCAL:MEMORY:OVERLAP: warning
[0] WARNING:    New send buffer overlaps with currently active receive buffer at address 0x17f0730.
[0] WARNING:    Control over active buffer was transferred to MPI at:
[0] WARNING:       MPI_Irecv(*buf=0x17f0730, count=1, datatype=MPI_INT, source=MPI_ANY_SOURCE, tag=99, comm=MPI_COMM_WORLD, *request=0x1c04470)
[0] WARNING:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:259)
[0] WARNING:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[0] WARNING:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[0] WARNING:       __libc_start_main (/usr/lib64/libc-2.28.so)
[0] WARNING:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[0] WARNING:    Control over new buffer is about to be transferred to MPI at:
[0] WARNING:       MPI_Send(*buf=0x17f0730, count=1, datatype=MPI_INT, dest=1, tag=99, comm=MPI_COMM_WORLD)
[0] WARNING:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[0] WARNING:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[0] WARNING:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[0] WARNING:       __libc_start_main (/usr/lib64/libc-2.28.so)
[0] WARNING:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)

[1] WARNING: LOCAL:MEMORY:OVERLAP: warning
[1] WARNING:    New send buffer overlaps with currently active receive buffer at address 0x11d48a0.
[1] WARNING:    Control over active buffer was transferred to MPI at:
[1] WARNING:       MPI_Irecv(*buf=0x11d48a0, count=1, datatype=MPI_INT, source=MPI_ANY_SOURCE, tag=99, comm=MPI_COMM_WORLD, *request=0x1219dc0)
[1] WARNING:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:259)
[1] WARNING:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[1] WARNING:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[1] WARNING:       __libc_start_main (/usr/lib64/libc-2.28.so)
[1] WARNING:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[1] WARNING:    Control over new buffer is about to be transferred to MPI at:
[1] WARNING:       MPI_Send(*buf=0x11d48a0, count=1, datatype=MPI_INT, dest=0, tag=99, comm=MPI_COMM_WORLD)
[1] WARNING:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[1] WARNING:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[1] WARNING:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[1] WARNING:       __libc_start_main (/usr/lib64/libc-2.28.so)
[1] WARNING:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
1.09176s, total time: 1.11445
Starting CG solver ...
Initial Residual = 11.0289
Iteration = 20   Residual = 1.23424e-08
Final Resid Norm: 2.06977e-16

[0] INFO: LOCAL:MEMORY:OVERLAP: found 2 times (0 errors + 2 warnings), 0 reports were suppressed
[0] INFO: Found 2 problems (0 errors + 2 warnings), 0 reports were suppressed.

If I use more then 2 processes, e.g. 72, then some OVERLAP warnings turn into ILLEGAL_MODIFICATION errors:

[54] ERROR: LOCAL:MEMORY:ILLEGAL_MODIFICATION: error
[54] ERROR:    Read-only buffer was modified while owned by MPI.
[54] ERROR:    Control over buffer was transferred to MPI at:
[54] ERROR:       MPI_Send(*buf=0x9693c4, count=1, datatype=MPI_INT, dest=22, tag=99, comm=MPI_COMM_WORLD)
[54] ERROR:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[54] ERROR:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[54] ERROR:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[54] ERROR:       __libc_start_main (/usr/lib64/libc-2.28.so)
[54] ERROR:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[54] ERROR:    Modified buffer detected at:
[54] ERROR:       MPI_Send(*buf=0x9693c4, count=1, datatype=MPI_INT, dest=22, tag=99, comm=MPI_COMM_WORLD)
[54] ERROR:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[54] ERROR:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[54] ERROR:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[54] ERROR:       __libc_start_main (/usr/lib64/libc-2.28.so)
[54] ERROR:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
maherou commented 1 year ago

@mawi2017 Thank you for reporting this issue. If you have a proposed fix, please feel free to submit a pull-request for review. We would appreciate your assistance in this way.

Thank you.

Mike