LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
99 stars 31 forks source link

NC_CLOBBER leads to delayed unlink problem #784

Open adammoody opened 1 year ago

adammoody commented 1 year ago

When creating a file in PnetCDF, one can use the NC_CLOBBER flag, which indicates that any existing file should first be unlinked or truncated to 0 bytes.

For regular files, the implementation calls unlink or MPI_File_delete on rank 0:

https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/src/drivers/ncmpio/ncmpio_create.c#L132-L153

while all other ranks wait in a call to MPI_Bcast to be signaled by rank 0 that it has the deleted the file.

https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/src/drivers/ncmpio/ncmpio_create.c#L202-L205

When running some tests with higher rank counts, random ranks fail with a "bad file descriptor" error. For example, test/testcases/test_erange.c fails when running 6 ranks on one node with the following errors:

+ srun --overlap -n 6 -N 1 ./test_erange /unifyfs/testfile.nc
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
Error at line 107 in test_erange.c: (NC_EWRITE)
MPI error (MPI_File_read_at_all) : Other I/O error , error stack:
ADIOI_GEN_READCONTIG(75): Other I/O error Bad file descriptor
Error at line 109 in test_erange.c: (NC_EREAD)
Error at line 111: unexpected read value 3 (expecting 255)
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
Error at line 117 in test_erange.c: (NC_EWRITE)
MPI error (MPI_File_read_at_all) : Other I/O error , error stack:
ADIOI_GEN_READCONTIG(75): Other I/O error Bad file descriptor
Error at line 119 in test_erange.c: (NC_EREAD)
Error at line 121: unexpected read value 0 (expecting -128)
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
Error at line 155 in test_erange.c: (NC_EWRITE)
MPI error (MPI_File_read_at_all) : Other I/O error , error stack:
ADIOI_GEN_READCONTIG(75): Other I/O error Bad file descriptor
Error at line 157 in test_erange.c: (NC_EREAD)
Error at line 159: unexpected read value -48 (expecting -128)
MPI error (MPI_File_close) : Other I/O error , error stack:
ADIOI_GEN_CLOSE(120): Other I/O error Bad file descriptor
Error at line 191 in test_erange.c: (NC_EFILE)
*** TESTING C   test_erange for checking for NC_ERANGE             ------ fail with 15 mismatches
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
Error at line 226 in test_erange.c: (NC_EWRITE)
MPI error (MPI_File_read_at_all) : Other I/O error , error stack:
ADIOI_GEN_READCONTIG(75): Other I/O error Bad file descriptor
Error at line 229 in test_erange.c: expecting NC_ERANGE but got NC_EREAD
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
MPI error (MPI_File_write_at_all) : Other I/O error , error stack:
ADIOI_GEN_WRITECONTIG(78): Other I/O error Bad file descriptor
Error at line 248 in test_erange.c: (NC_EWRITE)
MPI error (MPI_File_read_at_all) : Other I/O error , error stack:
ADIOI_GEN_READCONTIG(75): Other I/O error Bad file descriptor
Error at line 250 in test_erange.c: expecting NC_ERANGE but got NC_EREAD
MPI error (MPI_File_close) : Other I/O error , error stack:
ADIOI_GEN_CLOSE(120): Other I/O error Bad file descriptor
Error at line 252 in test_erange.c: (NC_EFILE)
srun: error: quartz5: tasks 0-5: Exited with exit code 1

This test program contains a number of consecutive test cases:

https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/testcases/test_erange.c#L287-L288

that each open the same file with NC_CLOBBER:

https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/testcases/test_erange.c#L49-L50

I believe our delayed unlink may be the cause. I think the file (and its descriptor) gets deleted in the background on some ranks after the file has been opened and while it is in use. A future write call then fails when it detects that the file descriptor is no longer valid.

Setting UNIFYFS_CLIENT_UNLINK_USECS does not help in this case, or at least the values of 1 sec and 10 secs do not help. I'm not yet sure why.

PnetCDF happens to have a code path in which it truncates the file rather than deletes it. It uses this for symlinks, but it deletes regular files. Hacking this so that PnetCDF truncates regular files (rather than deleting them) allows the test case to pass. That amounts to commenting out this line:

https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/src/drivers/ncmpio/ncmpio_create.c#L103

Short term, PnetCDF users who run into this bug may need the above one-line patch.

Medium term, perhaps the PnetCDF team would be open to defining a new hint to enable users to select the truncate path rather than the delete path.

Long term, we should fix our unlink implementation. https://github.com/LLNL/UnifyFS/issues/744