darshan-hpc / darshan

Darshan I/O characterization tool
Other
55 stars 27 forks source link

Incorrect Timing of File Close when Using NetCDF4 #906

Closed yzanhua closed 1 year ago

yzanhua commented 1 year ago

Summary

NetCDF4 can perform parallel IO using parallel HDF5. When using Darshan to capture a NetCDF4 application's I/O behavior, I observe that the actual file close is delayed from nc_close call to MPI_Finalize. The incorrect timing of file close will affect the correctness of Log VOL who needs to use/release HDF5 resources at file close time, some of which are not available at MPI_Finalize (e.g. H5T_STD_B8LE).

Reproduce

Test program

test.c is a simple NetCDF4 programs that open a NetCDF4 file and close directly. It also prints a string application: nc_close start and application: nc_close end before and after nc_close.

Click here to see test.c ```c #include #include #include #include #include #define FATAL_ERR {if(err!=NC_NOERR) {printf("Error at line=%d: %s Aborting ...\n", __LINE__, nc_strerror(err)); goto fn_exit;}} #define ERR {if(err!=NC_NOERR)printf("Error at line=%d: %s\n", __LINE__, nc_strerror(err));} int main(int argc, char** argv) { const char* filename="testfile"; int err; int ncid, cmode; MPI_Init(&argc, &argv); /* create a new file for writing ----------------------------------------*/ cmode = NC_NETCDF4 | NC_CLOBBER | NC_MPIIO; err = nc_create_par(filename, cmode, MPI_COMM_WORLD, MPI_INFO_NULL, &ncid); FATAL_ERR /* exit define mode */ err = nc_enddef(ncid); ERR /* close the file */ printf("========= application: nc_close start\n"); err = nc_close(ncid); ERR printf("========= application: nc_close end\n"); fn_exit: MPI_Finalize(); return 0; } ```

Compile and Run

Makefile is provided below. make to compile the program. make withdarshan and make nodarshan will run the program with/without darshan. Note that the Passthrough VOL is enabled so that a message can be printed when the actual file close happens. Passthrough VOL comes together with HDF5 installation, but we need to add CFLAGS="-DENABLE_PASSTHRU_LOGGING" when installing HDF5 in order to enable printing. The programs runs with 1 MPI process.

Click here to see Makefile ```makefile DARSHAN_DIR=${LOCAL_HOME}/Darshan/3.4.2/lib/libdarshan.so HDF5_DIR=${LOCAL_HOME}/HDF5/1.14.0 NETCDF_DIR=${LOCAL_HOME}/NetCDF/install all: mpicc test.c -g -o test \ -I${NETCDF_DIR}/include \ -L${NETCDF_DIR}/lib -lnetcdf withdarshan: HDF5_PLUGIN_PATH=${HDF5}/lib \ LD_LIBRARY_PATH=${NETCDF_DIR}/lib:${HDF5_DIR}/lib \ HDF5_VOL_CONNECTOR="pass_through under_vol=0;under_info={}" \ mpirun -n 1 -env LD_PRELOAD="${DARSHAN_DIR}" ./test nodarshan: HDF5_PLUGIN_PATH=${HDF5}/lib \ LD_LIBRARY_PATH=${NETCDF_DIR}/lib:${HDF5_DIR}/lib \ HDF5_VOL_CONNECTOR="pass_through under_vol=0;under_info={}" \ mpirun -n 1 ./test clean: rm -rf testfile core.* test ```

Outputs

The outputs for both darshan and no-darshan are below. They are expected to be the same but if Darshan is not enabled, we can see that PASS THROUGH VOL FILE Close occurs between application: nc_close start/end. And if Darshan is enabled, PASS THROUGH VOL FILE Close occurs after application: nc_close end.

Click here to see the no-darshan (expected) output ```txt HDF5_PLUGIN_PATH=/lib \ LD_LIBRARY_PATH=/files2/scratch/zhd1108/NetCDF/install/lib:/files2/scratch/zhd1108/HDF5/1.14.0/lib \ HDF5_VOL_CONNECTOR="pass_through under_vol=0;under_info={}" \ mpirun -n 1 ./test ------- PASS THROUGH VOL INIT ------- PASS THROUGH VOL INFO String To Info ------- PASS THROUGH VOL INFO Copy ------- PASS THROUGH VOL INFO Copy ------- PASS THROUGH VOL FILE Create ------- PASS THROUGH VOL INFO Copy ------- PASS THROUGH VOL INFO Copy ------- PASS THROUGH VOL INFO Free ------- PASS THROUGH VOL INFO Copy ------- PASS THROUGH VOL INFO Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL INTROSPECT OptQuery ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL File Optional ------- PASS THROUGH VOL WRAP Object ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL GROUP Open ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL INFO Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL FILE Get ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL ATTRIBUTE Specific ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL ATTRIBUTE Create ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL ATTRIBUTE Write ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL ATTRIBUTE Close ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL FILE Specific ------- PASS THROUGH VOL WRAP CTX Free ========= application: nc_close start ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL ATTRIBUTE Specific ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL FILE Specific ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL H5Gclose ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL FILE Close ------- PASS THROUGH VOL INFO Free ------- PASS THROUGH VOL UNWRAP Object ------- PASS THROUGH VOL WRAP CTX Free ========= application: nc_close end ------- PASS THROUGH VOL INFO Free ------- PASS THROUGH VOL INFO Free ------- PASS THROUGH VOL TERM ```
Click here to see output if darshan is enabled ```txt HDF5_PLUGIN_PATH=/lib \ LD_LIBRARY_PATH=/files2/scratch/zhd1108/NetCDF/install/lib:/files2/scratch/zhd1108/HDF5/1.14.0/lib \ HDF5_VOL_CONNECTOR="pass_through under_vol=0;under_info={}" \ mpirun -n 1 -env LD_PRELOAD="/files2/scratch/zhd1108/Darshan/3.4.2/lib/libdarshan.so" ./test ------- PASS THROUGH VOL INIT ------- PASS THROUGH VOL INFO String To Info ------- PASS THROUGH VOL INFO Copy ------- PASS THROUGH VOL INFO Copy ------- PASS THROUGH VOL FILE Create ------- PASS THROUGH VOL INFO Copy ------- PASS THROUGH VOL INFO Copy ------- PASS THROUGH VOL INFO Free ------- PASS THROUGH VOL INFO Copy ------- PASS THROUGH VOL INFO Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL INTROSPECT OptQuery ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL File Optional ------- PASS THROUGH VOL WRAP Object ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL GROUP Open ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL INFO Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL FILE Get ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL ATTRIBUTE Specific ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL ATTRIBUTE Create ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL ATTRIBUTE Write ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL ATTRIBUTE Close ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL FILE Specific ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL OBJECT Get ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL Get object ========= application: nc_close start ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL ATTRIBUTE Specific ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL FILE Specific ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL OBJECT Get ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL Get object ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL H5Gclose ------- PASS THROUGH VOL WRAP CTX Free ========= application: nc_close end ------- PASS THROUGH VOL WRAP CTX Get ------- PASS THROUGH VOL FILE Close ------- PASS THROUGH VOL INFO Free ------- PASS THROUGH VOL UNWRAP Object ------- PASS THROUGH VOL WRAP CTX Free ------- PASS THROUGH VOL INFO Free ------- PASS THROUGH VOL INFO Free ------- PASS THROUGH VOL TERM ```

Library Version

  1. HDF5 1.14.0. configured with --enable-parallel, --enable-build-mode=debug, and CFLAGS="-DENABLE_PASSTHRU_LOGGING"
  2. NetCDF 4.9.1. configured with --disable-dap --disable-mmap --disable-nczarr --disable-byterange. (some configure options here are necessary to avoid known compiling issues with HDF5 1.14.0)
  3. Darshan 3.4.2

(I tested that using HDF5 1.13.2 and NetCDF 4.9.0 can also reproduce the problem.)

Other findings

The problem can be reproduced without the use of a Passthrough VOL. We can add a print statement for the info->count in the HDF5 source codes here. It shows the reference count of an object. If Darshan is not enabled, the reference count is 1 at the time nc_close calls H5Fclose. If Darshan is enabled, the reference count is 3 so it thinks someone else is still accessing the file and will delay the actual close to very end. I am not sure whether Darshan holds an extra reference to the file or it is as issue more related to NetCDF4. Using HDF5 directly (no NetCDF4 involved) does not give this issue.