HDFGroup / vol-cache

HDF5 Cache VOL connector for caching data on fast storage layers and moving data asynchronously to the parallel file system to hide I/O overhead.
https://vol-cache.readthedocs.io
BSD 3-Clause "New" or "Revised" License
16 stars 8 forks source link

H5Fget_access_plist does not return a valid faplid #15

Open yzanhua opened 1 year ago

yzanhua commented 1 year ago

Summary

When using Cache Vol and Async Vol, it seems that H5Fget_access_plist does not return a valid faplid. The returned id is non-negative but seems not a property list.

Error Details

% echo $HDF5_VOL_CONNECTOR 
cache_ext config=cache_1.cfg;under_vol=512;under_info={under_vol=0;under_info={}}

% mpirun -n 1 ./test
HDF5-DIAG: Error detected in HDF5 (1.13.3-1) MPI-process 0:
  #000: ../../hdf5-dev/src/H5Pfapl.c line 1487 in H5Pget_driver(): can't get driver
    major: Property lists
    minor: Can't get value
  #001: ../../hdf5-dev/src/H5Pfapl.c line 1444 in H5P_peek_driver(): not a file access property list
    major: Property lists
    minor: Inappropriate type
  #002: ../../hdf5-dev/src/H5Pint.c line 4067 in H5P_isa_class(): not a property list
    major: Invalid arguments to routine
    minor: Inappropriate type

Test Program

Click here to see the test program: ```c++ #include #include #include #include #define N 10 #define CHECK_ERR(A) {if (A < 0) { printf("Error at line %d: code %d\n", __LINE__, A); }} int main(int argc, char **argv) { herr_t err = 0; int mpi_required; const char *file_name = "test.h5"; hid_t fid = -1; // File ID hid_t faplid = -1; // File Access Property List hid_t plist_id = -1; hid_t faplid2 = -1; // init MPI err = MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &mpi_required); CHECK_ERR(err); // create file faplid = H5Pcreate(H5P_FILE_ACCESS); CHECK_ERR(faplid); H5Pset_fapl_mpio(faplid, MPI_COMM_WORLD, MPI_INFO_NULL); fid = H5Fcreate(file_name, H5F_ACC_TRUNC, H5P_DEFAULT, faplid); CHECK_ERR(fid); // get faplid faplid2 = H5Fget_access_plist (fid); CHECK_ERR (faplid2); plist_id = H5Pget_driver (faplid2); // Error occurs here if (fid >= 0) H5Fclose(fid); if (faplid >= 0) H5Pclose(faplid); MPI_Finalize(); return 0; } ```

Libraries Versions (commit number)

Click here to see the details 1. HDF5 develop branch: HDFGroup/hdf5@b5598575bb8a2495d6f306233b00d612258ad718 2. Argobots main branch: pmodels/argobots@dce6e727ffc4ca5b3ffc04cb9517c6689be51ec5 3. AsyncVol develop branch: hpc-io/vol-async@0a92d232ed01ecbb6ab59fbfa4807458c88922a7 4. Cache Vol develop branch: hpc-io/vol-cache@f453900b64cfbc5d3197acb5292e6e379ce2ac20
wkliao commented 1 year ago

Will this issue be addressed soon?

zhenghh04 commented 1 year ago

Hi @wkliao @yzanhua, this is an issue of the HDF5 library. I encountered this when I was running E3SM-IO. I have to comment out H5Fget_access_plist in the code to make it running. I mentioned it to Neil before. Maybe report this to HDF5?

wkliao commented 1 year ago

I am not sure whether this is HDF5's issue. @yzanhua testes the small program he provided in this PR using the followings. It failed only when using Cache+Async VOLs.

Cache+Async VOL: fail Cache VOL only: success Passthrough VOL only: success Log VOL only: success

using: HDF5: 1.13.3, Cache VOL: master branch Async VOL: v1.4

yzanhua commented 1 year ago

It also fails when using Async only. It seems like Async VOL (instead of Cache VOL) is not handling faplid correctly.

zhenghh04 commented 1 year ago

Yes, it is with Async + HDF5. @houjun , did you encounter this issue before?

houjun commented 1 year ago

Yes, I remember it is related to future ID when async is used, I'll take another look and check with HDF people.

yzanhua commented 1 year ago

The provided test program failed in H5Pget_driver (the line where the invalid faplid2 is first used). I also tested using other H5Pget_xxxxs to replace H5Pget_driver to see if the program still fails. The results might be helpful to debugging.

H5Pget_driver_info, H5Pget_fapl_mpio andH5Pget_fapl_core fail with the same error messages, complaining about "not a property list".

However, using H5Pget_fclose_degree and H5Pget_evict_on_close can run without a problem.