StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
668 stars 146 forks source link

realm: hdf5 write to disk fails on rectangular region #1169

Open rohany opened 2 years ago

rohany commented 2 years ago

This small program:

#include "legion.h"
#include <hdf5.h>

using namespace Legion;

enum TaskIDs {
  TID_TOP_LEVEL,
};

enum FieldIDs {
  FID_VAL,
};

void top_level_task(const Task* task, const std::vector<PhysicalRegion>& regions, Context ctx, Runtime* runtime) {
  // Program succeeds if dimx == dimy.
  auto dimx = 6;
  auto dimy = 10;
  auto ispace = runtime->create_index_space(ctx, Rect<2>({0, 0}, {dimx - 1, dimy - 1}));
  auto fspace = runtime->create_field_space(ctx);
  {
    auto alloc = runtime->create_field_allocator(ctx, fspace);
    alloc.allocate_field(sizeof(double), FID_VAL);
  }
  auto reg = runtime->create_logical_region(ctx, ispace, fspace);
  runtime->fill_field(ctx, reg, reg, FID_VAL, double(1));

  // Write this region to an HDF5 file.
  std::string filename = "dummy.hdf5";
  hid_t fileID = H5Fcreate(filename.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
  hsize_t dims[2] = {hsize_t(dimx), hsize_t(dimy)};
  hid_t dataspace = H5Screate_simple(2, dims, NULL);
  hid_t dataset = H5Dcreate2(fileID, "vals", H5T_IEEE_F64LE_g, dataspace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
  // Close up everything.
  H5Dclose(dataset);
  H5Sclose(dataspace);
  H5Fclose(fileID);

  // Attach the region to the HDF5 file and write it out.
  auto regCopy = runtime->create_logical_region(ctx, reg.get_index_space(), reg.get_field_space());
  AttachLauncher al(LEGION_EXTERNAL_HDF5_FILE, regCopy, regCopy);
  al.attach_hdf5(filename.c_str(), {{FID_VAL, "vals"}}, LEGION_FILE_READ_WRITE);
  auto pr = runtime->attach_external_resource(ctx, al);
  CopyLauncher cl;
  cl.add_copy_requirements(RegionRequirement(reg, READ_ONLY, EXCLUSIVE, reg),
                           RegionRequirement(regCopy, WRITE_DISCARD, EXCLUSIVE, regCopy));
  cl.add_src_field(0, FID_VAL); cl.add_dst_field(0, FID_VAL);
  runtime->issue_copy_operation(ctx, cl);
  runtime->detach_external_resource(ctx, pr).wait();
}

int main(int argc, char** argv) {
  Runtime::set_top_level_task_id(TID_TOP_LEVEL);
  {
    TaskVariantRegistrar registrar(TID_TOP_LEVEL, "top_level");
    registrar.add_constraint(ProcessorConstraint(Processor::LOC_PROC));
    registrar.set_replicable();
    Runtime::preregister_task_variant<top_level_task>(registrar, "top_level");
  }
  return Runtime::start(argc, argv);
}

fails with an HDF5 error within a realm transfer operation:

HDF5-DIAG: Error detected in HDF5 (1.12.1) thread 0:
  #000: H5Dio.c line 291 in H5Dwrite(): can't write data
    major: Dataset
    minor: Write failed
  #001: H5VLcallback.c line 2113 in H5VL_dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
  #002: H5VLcallback.c line 2080 in H5VL__dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
  #003: H5VLnative_dataset.c line 200 in H5VL__native_dataset_write(): could not get a validated dataspace from file_space_id
    major: Invalid arguments to routine
    minor: Bad value
  #004: H5S.c line 266 in H5S_get_validated_dataspace(): selection + offset not within extent
    major: Dataspace
    minor: Out of range
HDF5 error on H5Dwrite(dset->dset_id, dset->dtype_id, mem_space_id, file_space_id, H5P_DEFAULT, fill_data):
HDF5-DIAG: Error detected in HDF5 (1.12.1) thread 0:
  #000: H5Dio.c line 291 in H5Dwrite(): can't write data
    major: Dataset
    minor: Write failed
  #001: H5VLcallback.c line 2113 in H5VL_dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
  #002: H5VLcallback.c line 2080 in H5VL__dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
  #003: H5VLnative_dataset.c line 200 in H5VL__native_dataset_write(): could not get a validated dataspace from file_space_id
    major: Invalid arguments to routine
    minor: Bad value
  #004: H5S.c line 266 in H5S_get_validated_dataspace(): selection + offset not within extent
    major: Dataspace
    minor: Out of range
Assertion failed: (0), function progress_xd, file /Users/rohany/Documents/research/taco/legion/legion/runtime/realm/hdf5/hdf5_internal.cc, line 505.

The program succeeds if the dimx and dimy variables above are changed to have the same value. This is probably a small realm fix @streichler.

rohany commented 2 years ago

Pinging here, is there a chance that this is a quick fix? It's blocking some experiments I want to run.

elliottslaughter commented 5 months ago

@rohany Still relevant?

rohany commented 5 months ago

I haven't run the code that needs it in a long time, so it doesn't affect me

elliottslaughter commented 5 months ago

It's not urgent, but if you could rerun the reproducer on master at some point and confirm whether or not it still reproduces, that would help us keep these issues fresh. We don't want to keep issues around forever if we can't reproduce them.