LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
105 stars 31 forks source link

PnetCDF unlink race condition in nc_test #744

Open adammoody opened 1 year ago

adammoody commented 1 year ago

In PnetCDF, the nc_test has a number of quick running tests that each create a scratch file, execute I/O operations on that file, and then delete the scratch file. The consecutive sequence of tests shown here complete quickly and all use the same filename for the scratch file.

https://github.com/Parallel-NetCDF/PnetCDF/blob/6c71a30cd95f575c01025c0c926fc06dc9157774/test/nc_test/nc_test.c#L414-L420

The MPI_File_open() call of one of these tests fails when it attempts to sync extents with the server during an internal call to close(). ROMIO's MPI_File_open() calls both open() and close() on the file. The close() then tries to sync extents with the server because the file had been opened for writing.

https://github.com/Parallel-NetCDF/PnetCDF/blob/6c71a30cd95f575c01025c0c926fc06dc9157774/test/nc_test/test_write.m4#L394

The extent sync fails because the file has been deleted so that meta->fid (-1) != fid (2) at this check:

https://github.com/LLNL/UnifyFS/blob/a13edaf779c755b3314f1bd7ce7f798d532d9951/client/src/unifyfs_fid.c#L1086-L1096

This situation happens because the prior test deleted the scratch file via MPI_File_delete() -> unlink().

https://github.com/Parallel-NetCDF/PnetCDF/blob/6c71a30cd95f575c01025c0c926fc06dc9157774/test/nc_test/test_write.m4#L437

That unlink() invokes an unlink rpc from client-to-server, which induces a later server-to-client unlink callback rpc that comes back to the client from the server at some future point in time. In this case, the unlink callback fires in the middle of MPI_File_open's open() and close() calls. The open() successfully recreates the file, then it is deleted due to the unlink callback, and then the close() fails to sync extents, which causes MPI_File_open to return an error:

https://github.com/LLNL/UnifyFS/blob/a13edaf779c755b3314f1bd7ce7f798d532d9951/client/src/unifyfs-sysio.c#L2322-L2330