ORNL-CEES / DataTransferKit

A library for multiphysics solution transfer. ARCHIVED
https://datatransferkit.readthedocs.io/en/dtk-3.0/
BSD 3-Clause "New" or "Revised" License
47 stars 26 forks source link

Moving least square operator problem... #580

Closed mattbement closed 9 months ago

mattbement commented 3 years ago

For the simple test program below, I get an error in MPI_Waitall. If I switch to the nearest neighbor operator, everything works as expected.

--------------------------------------------------------------------------
[mtb20:88890] *** An error occurred in MPI_Waitall
[mtb20:88890] *** reported by process [343932929,1]
[mtb20:88890] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[mtb20:88890] *** MPI_ERR_TRUNCATE: message truncated
[mtb20:88890] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mtb20:88890] ***    and potentially your MPI job)
[mtb20:88885] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[mtb20:88885] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
#include <stdlib.h>
#include <string.h>

#include <cstdlib>
#include <Kokkos_Core.hpp>
#include <DTK_NearestNeighborOperator.hpp>
#include <DTK_MovingLeastSquaresOperator.hpp>

using Space=Kokkos::HostSpace;

int main(int argc, char* argv[]) {

  MPI_Comm transfer_comm;

  MPI_Init(&argc, &argv); 
  Kokkos::ScopeGuard s(argc,argv);
  int nc = 27;

  MPI_Comm_dup(MPI_COMM_WORLD, &transfer_comm);
  if (transfer_comm == MPI_COMM_NULL) return 1;
  int this_rank;
  MPI_Comm_rank(transfer_comm,&this_rank);

  Kokkos::View<double*,Space> temp0;
  Kokkos::View<double*,Space> temp1;
  Kokkos::View<double**,Kokkos::LayoutRight,Space> coord0;
  Kokkos::View<double**,Kokkos::LayoutRight,Space> coord1;

  if (this_rank==0) {
    temp0 = Kokkos::View<double*,Space>("temp0",nc);
    coord0 = Kokkos::View<double**,Kokkos::LayoutRight,Space>("coord0",nc,3);
    //coord1 = Kokkos::View<double**,Kokkos::LayoutRight,Space>("coord0",0,3);
    for (int i = 0; i < nc; ++i) temp0(i) = 1.0;
    int cnum = 0;
    for (int i = 0; i < 3; ++i){
      for (int j = 0; j < 3; ++j) {
        for (int k = 0; k < 3; ++k) {
          coord0(cnum,0)=i*.1;
          coord0(cnum,1)=j*.1;
          coord0(cnum,2)=k*.1;
          cnum++;
        }
      }
    }
  }

 if (this_rank==1) {
    temp1 = Kokkos::View<double*,Space>("temp0",nc);
    coord1 = Kokkos::View<double**,Kokkos::LayoutRight,Space>("coord1",nc,3);

    //coord0 = Kokkos::View<double**,Kokkos::LayoutRight,Space>("coord0",0,3);
    int cnum = 0;
    for (int i = 0; i < 3; ++i){
      for (int j = 0; j < 3; ++j) {
        for (int k = 0; k < 3; ++k) {
          coord1(cnum,0)=i*.1;
          coord1(cnum,1)=j*.1;
          coord1(cnum,2)=k*.1;
          cnum++;
        }
      }
    }
  }

  //DataTransferKit::NearestNeighborOperator<Kokkos::HostSpace::device_type> myop(transfer_comm, coord0, coord1);
  DataTransferKit::MovingLeastSquaresOperator<Kokkos::HostSpace::device_type> myop(transfer_comm,coord0,coord1);

  myop.apply(temp0,temp1);

  if (this_rank==1) {
    printf("temp1(0) = %f\n", temp1(0));
  }

  MPI_Comm_free(&transfer_comm);
  MPI_Finalize();
  return 0;
}
sslattery commented 3 years ago

Can you try an MPI barrier on transfer_comm right before you free that communicator at the end of main? I want to make sure there's not some kind of barrier missing here.

sslattery commented 3 years ago

Can you try an MPI barrier on transfer_comm right before you free that communicator at the end of main? I want to make sure there's not some kind of barrier missing here.

OK maybe that's not it. I'll cross reference this with out unit tests.

sslattery commented 3 years ago

@Rombur @masterleinad Matt's problem has sources only on rank 0 and targets only on rank 1. I don't think we have a good test for this one. Matt's test is pretty simple and should be easy to add to the unit tests.

mattbement commented 3 years ago

At Stuart's suggestion, if I uncomment the intialization of coord0 and coord1 to 0 length views, things seem to work fine. Note that the nearest neighbor operator worked fine without having to initialize these to 0 length.

sslattery commented 3 years ago

@Rombur @masterleinad Should we support uninitialized views and assume them to be size 0?

masterleinad commented 3 years ago

This would have failed with DBC in place (and I think that should be enabled in Debug mode by default). In general, I would consider any use of uninitialized Kokkos::View objects a user error.

dalg24 commented 3 years ago

This would have failed with DBC in place (and I think that should be enabled in Debug mode by default). In general, I would consider any use of uninitialized Kokkos::View objects a user error.

Would you please post a reference to the assertion that you expect would fail?

masterleinad commented 3 years ago

Would you please post a reference to the assertion that you expect would fail?

https://github.com/ORNL-CEES/DataTransferKit/blob/d7a54fbfa64a6960467ce0bf556956f25c35fe58/packages/Meshfree/src/DTK_MovingLeastSquaresOperator_def.hpp#L38-L41

masterleinad commented 9 months ago

DataTransferKit is going to be archived.