In testing PnetCDF, some tests fail when creating a file after deleting a file by the same name (see https://github.com/LLNL/UnifyFS/issues/744). As a work around, this adds an optional sleep immediately after a client calls the client-to-server unlink rpc to give the unlink operation more time to complete before the client returns from its call to unlink().
To enable this option, one can set a new config parameter:
export UNIFYFS_CLIENT_UNLINK_USECS=1000000
For the first test case that was failing, which was a serial program (single-process MPI job), a value of 1000000 (1 second) was sufficient. Higher sleep times may be required for parallel jobs.
This is a hack, but it helps for now.
A better fix would be to implement a mode where the unlink() wrapper blocks at the calling client until all servers have indicated that the unlink operation has completed. That may require a round trip between each server with each of its clients, since each client has to do some work to support unlink. That change will be a more substantial effort, and so it is saved for future work. Once added, this particular work around could be removed.
Description
Motivation and Context
How Has This Been Tested?
Types of changes
[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Performance enhancement (non-breaking change which improves efficiency)
[ ] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Testing (addition of new tests or update to current tests)
[x] Documentation (a change to man pages or other documentation)
Checklist:
[x] My code follows the UnifyFS code style requirements.
In testing PnetCDF, some tests fail when creating a file after deleting a file by the same name (see https://github.com/LLNL/UnifyFS/issues/744). As a work around, this adds an optional sleep immediately after a client calls the client-to-server unlink rpc to give the unlink operation more time to complete before the client returns from its call to
unlink()
.To enable this option, one can set a new config parameter:
For the first test case that was failing, which was a serial program (single-process MPI job), a value of 1000000 (1 second) was sufficient. Higher sleep times may be required for parallel jobs.
This is a hack, but it helps for now.
A better fix would be to implement a mode where the
unlink()
wrapper blocks at the calling client until all servers have indicated that the unlink operation has completed. That may require a round trip between each server with each of its clients, since each client has to do some work to support unlink. That change will be a more substantial effort, and so it is saved for future work. Once added, this particular work around could be removed.Description
Motivation and Context
How Has This Been Tested?
Types of changes
Checklist: