LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
99 stars 31 forks source link

Potential client crash at MPI_Finalize() time #777

Closed wangvsa closed 1 year ago

wangvsa commented 1 year ago

When MPI attributes were created with MPI_Comm_delete_attr_function* call backs, the MPI_Finalize() call will invoke the call back for the user to clean up resources. HDF5 uses the call back to write cached data to the file at MPI_Finalize() time.

Example found from HDF5 source code:

MPI_Comm_create_keyval(MPI_NULL_COPY_FN, (MPI_Comm_delete_attr_function *)H5_mpi_delete_cb,  &key_val, NULL)
MPI_Comm_set_attr(MPI_COMM_SELF, key_val, NULL)
MPI_Comm_free_keyval(&key_val)

However, when we intercepted the MPI_Finalize() call, we unmount the client before calling PMPI_Finalize(), which causes HDF5 to write to non-existing files that may crash the client. I noticed this issue when running qmcpack (it uses HDF5).

Fix: call PMPI_Finalize() before we unmount the client, if unifyfs_unmount() does not use MPI.

adammoody commented 1 year ago

Great debugging, @wangvsa ! Your proposed fix also sounds good to me.