darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 27 forks source link

H5Pset_fapl_mpio fails with darshan #960

Closed wangvsa closed 11 months ago

wangvsa commented 11 months ago

Hi,

I was trying to trace some HDF5 applications and stumbled upon this issue.

HDF5: 1.8.20 (parallel hdf5 enabled) Darshan version: darshan-3.4.4

Darshan was configured using: ./configure --prefix=/xxx/darshan-3.4.4/install --with-log-path=/xxx/darshan-logs --with-jobid-env=SLURM_JOB_ID --with-hdf5=$HDF5_1_8_HOME --enable-hdf5-mod CC=mpicc

The issue can be reproduced using the following code:

int main(int argc, char* argv[]) {
    MPI_Init(&argc, &argv);

    int res;
    hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);
    if (plist_id == H5I_INVALID_HID)
        printf("H5Pcreate failed\n");

    res = H5Pset_fapl_mpio(plist_id, MPI_COMM_WORLD, MPI_INFO_NULL);
    if (res < 0)
        printf("H5Pset_fapl_mpio failed\n");

    H5Pclose(plist_id);
    MPI_Finalize();
}

Compile & run:

mpicc test_phdf5.c -o test_phdf5 -I$HDF5_1_8_HOME/include -L$HDF5_1_8_HOME/lib -lhdf5
srun -n1 --overlap --export=ALL,LD_PRELOAD=$libdarshan ./test_phdf5

Without darshan, the code finishes without any error. With darshan, H5Pset_fapl_mpio call fails.

shanedsnyder commented 11 months ago

Thanks for the report!

I did confirm the same issue. From some quick testing, I think this may be related to this "workaround" commit we merged in our last release: #833

Basically, some change in HDF5 headers in version 1.13+ was causing Darshan to print out a symbol error when it was LD_PRELOADed. We found what we thought was a workaround, but it seems like it might be leading to this particular error now. I need to dig more to confirm that and see if there's an actual way to resolve both issues.

In the meantime, I think you could roll back to Darshan 3.4.3 and avoid the issue. You aren't missing anything with 3.4.4 -- that was the only HDF5 related change and it was only intended to avoid an issue with much newer HDF5 versions than 1.8.