dash-project / dash

DASH, the C++ Template Library for Distributed Data Structures with Support for Hierarchical Locality for HPC and Data-Driven Science
http://www.dash-project.org/
Other
155 stars 44 forks source link

Invalid datatype error on HPE Cray #708

Open bertwesarg opened 4 years ago

bertwesarg commented 4 years ago

Got this error from a SPEC reporter:

MPI VERSION    : CRAY MPICH version 8.0.11.5 (ANL base 3.3)
MPI BUILD INFO : Wed May 20  2:29 2020 (git hash fc79972) (CH4)
MPICH ERROR [Rank 51] [job id 4345.7] [Sun Jun  7 10:15:11 2020] [unknown] [nid000371] - Abort(134904579) (rank 51 in comm 0): Fatal error in PMPI_Type_free: Invalid datatype, error stack:
PMPI_Type_free(153): MPI_Type_free(datatype_p=0x26addc) failed
PMPI_Type_free(90).: Invalid datatype

Any ideas where this comes from or what I could request from the report?

devreal commented 4 years ago

It would be interesting to know if that occurs during the run or at the end and ideally a stack trace would be helpful.

I see two possible places:

1) The large transfer types (vector types used to transfer >2G data); 2) The strided and indexed types used to handle the halo transfers (which seems more likely here)

dhinf commented 4 years ago
  1. unlikely
  2. not possible, because strided and indexed isn't used for halo transfers anymore
bertwesarg commented 4 years ago

I see two instances of invalidating the contiguous.max_type:

here MPI_DATATYPE_NULL is used.

But here and here DART_MPI_TYPE_UNDEFINED is used.

This seems erroneous, as max_type is of type MPI_Datatype. Though, I do not think that this is the problem here.

bertwesarg commented 4 years ago

While I do not think this will fix it, but here is some cleanup and a check before calling MPI_Type_free

diff --git i/dart-impl/mpi/include/dash/dart/mpi/dart_communication_priv.h w/dart-impl/mpi/include/dash/dart/mpi/dart_communication_priv.h
index d146865c9..1a66ea3f4 100644 dart-impl/mpi/include/dash/dart/mpi/dart_communication_priv.h
--- i/dart-impl/mpi/include/dash/dart/mpi/dart_communication_priv.h
+++ w/dart-impl/mpi/include/dash/dart/mpi/dart_communication_priv.h
@@ -75,8 +75,6 @@ dart_ret_t dart__mpi__op_fini();
  */
 #define MAX_CONTIG_ELEMENTS (INT_MAX)

-#define DART_MPI_TYPE_UNDEFINED (MPI_Datatype)MPI_UNDEFINED
-
 typedef enum {
   DART_KIND_BASIC = 0,
   DART_KIND_STRIDED,
@@ -190,7 +188,7 @@ MPI_Datatype dart__mpi__datatype_maxtype(dart_datatype_t dart_type) {
   dart_datatype_struct_t *dts = dart__mpi__datatype_struct(dart_type);
   MPI_Datatype res;
   if (dart__mpi__datatype_iscontiguous(dart_type)) {
-    if (dts->contiguous.max_type == DART_MPI_TYPE_UNDEFINED) {
+    if (dts->contiguous.max_type == MPI_DATATYPE_NULL) {
       dts->contiguous.max_type = dart__mpi__datatype_create_max_datatype(
                                   dts->contiguous.mpi_type);
     }
diff --git i/dart-impl/mpi/src/dart_communication.c w/dart-impl/mpi/src/dart_communication.c
index b4da40d73..7f3330360 100644 dart-impl/mpi/src/dart_communication.c
--- i/dart-impl/mpi/src/dart_communication.c
+++ w/dart-impl/mpi/src/dart_communication.c
@@ -391,11 +391,11 @@ dart__mpi__put_basic(
     CHECK_MPI_RET(
         dart__mpi__put(src_ptr,
           nchunks,
-          dart__mpi__datatype_struct(dtype)->contiguous.max_type,
+          dart__mpi__datatype_maxtype(dtype),
           team_unit_id.id,
           offset,
           nchunks,
-          dart__mpi__datatype_struct(dtype)->contiguous.max_type,
+          dart__mpi__datatype_maxtype(dtype),
           win,
           reqs, num_reqs),
         "MPI_Put");
diff --git i/dart-impl/mpi/src/dart_mpi_types.c w/dart-impl/mpi/src/dart_mpi_types.c
index e90c380f3..2ab8a21c9 100644 dart-impl/mpi/src/dart_mpi_types.c
--- i/dart-impl/mpi/src/dart_mpi_types.c
+++ w/dart-impl/mpi/src/dart_mpi_types.c
@@ -312,7 +312,7 @@ dart_type_create_custom(
   new_struct->contiguous.size     = num_bytes;
   new_struct->contiguous.mpi_type = new_mpi_dtype;
   // max_type will be created on-demand for custom types
-  new_struct->contiguous.max_type = DART_MPI_TYPE_UNDEFINED;
+  new_struct->contiguous.max_type = MPI_DATATYPE_NULL;

   *newtype = (dart_datatype_t)new_struct;
   DART_LOG_TRACE("Created new custom data type %p with %zu bytes`",
@@ -343,7 +343,7 @@ dart_type_destroy(dart_datatype_t *dart_type_ptr)
     MPI_Type_free(&dart_type->indexed.mpi_type);
   } else if (dart_type->kind == DART_KIND_CUSTOM) {
     MPI_Type_free(&dart_type->contiguous.mpi_type);
-    if (dart_type->contiguous.max_type != DART_MPI_TYPE_UNDEFINED) {
+    if (dart_type->contiguous.max_type != MPI_DATATYPE_NULL) {
       MPI_Type_free(&dart_type->contiguous.max_type);
     }
   }
@@ -357,7 +357,8 @@ dart_type_destroy(dart_datatype_t *dart_type_ptr)
 static void destroy_basic_type(dart_datatype_t dart_type_id)
 {
   dart_datatype_struct_t *dart_type = dart__mpi__datatype_struct(dart_type_id);
-  MPI_Type_free(&dart_type->contiguous.max_type);
+  if (dart_type->contiguous.max_type != MPI_DATATYPE_NULL)
+    MPI_Type_free(&dart_type->contiguous.max_type);
   dart_type->contiguous.max_type = MPI_DATATYPE_NULL;
 }
devreal commented 4 years ago

Can you please post a PR for this? :+1:

bertwesarg commented 4 years ago

Sure, do you think it fixes anything related to my problem here?

devreal commented 4 years ago

Hard to say, the confusion of MPI_DATATYPE_NULL and DART_MPI_TYPE_UNDEFINED may be a reason but it's hard to say really.

devreal commented 4 years ago

Actually, the last two lines of the patch might be the culprit (https://github.com/dash-project/dash/pull/709/files#diff-f99e41ced414d50f5b467c4e54685f19R360).

bertwesarg commented 4 years ago

Actually, the last two lines of the patch might be the culprit (https://github.com/dash-project/dash/pull/709/files#diff-f99e41ced414d50f5b467c4e54685f19R360).

But this should be != MPI_DATATYPE_NULL for all basic types, thus it should not matter.

bertwesarg commented 4 years ago

I will push the changes to SPEC and ask kindly if this is fixed on HPE Cray