Attempting to parallel transfer a file with either the unifyfs-stage or the transfer-static application fails.
Found while adding a parallel transfer test to #685. The test is skipped in that PR. Once resolved, we'll want to enable that test.
Describe how to reproduce the problem
# start unifyfs server
# create a file to transfer; needs to be large enough to avoid the logic that defaults to serial transfer for small files
export source_file=${MY_JOBDIR}/src_transfer_in.file
dd if=/dev/urandom bs=1M count=100 of=$source_file
# do a parallel transfer
export unifyfs_file=/unifyfs/im_transfer_parallel.file
jsrun -n 4-a 2 -c 18 -r 1 -e individual -o $UNIFYFS_LOG_DIR/transfer-static-in-parallel.out -k $UNIFYFS_LOG_DIR/transfer-static-in-parallel.err $UNIFYFS_EXAMPLES/transfer-static $source_file $unifyfs_file -p
Include any warning or errors or releveant debugging data
There appears to be several points in the logic where multiple processes are not coordinating during the transfer.
2021-09-03T15:21:26 tid=40279 @ invoke_client_metaget_rpc() [margo_client.c:487] Got response ret=2
2021-09-03T15:21:26 tid=40279 @ transfer_file_parallel() [posix_client.c:753] failed to open() destination file /unifyfs/im_stage_parallel.file
[1] failed to transfer file (err=2)
failed to transfer file (src=/p/gpfs1/stanavig/jobs/unify/lassen/2776705/src_transfer_in.file, dst=/unifyfs/im_stage_parallel.file): Unknown error -2
data transfer failed (No such file or directory)
Upon changing the logic so all ranks open the file, the parallel transfer then becomes a race condition.
It appears that once one of the clients finishes writing, syncs, and laminates, that first server then broadcasts the laminate and the other servers fail with
2021-09-07T14:06:33 tid=61200 @ process_service_requests() [unifyfs_service_manager.c:1299] processing 1 service requests
2021-09-07T14:06:33 tid=61200 @ process_laminate_bcast_rpc() [unifyfs_service_manager.c:1157] gfid=43719796 num_extents=2
2021-09-07T14:06:33 tid=61200 @ unifyfs_inode_add_extents() [unifyfs_inode.c:363] trying to add extents to a laminated file (gfid=43719796)
2021-09-07T14:06:33 tid=61200 @ sm_add_extents() [unifyfs_service_manager.c:510] failed to add 2 extents to gfid=43719796 (rc=22, is_owner=0)
2021-09-07T14:06:33 tid=61200 @ process_laminate_bcast_rpc() [unifyfs_service_manager.c:1179] extent add during laminate(gfid=43719796) failed - rc=22
Describe the problem you're observing
Attempting to parallel transfer a file with either the
unifyfs-stage
or thetransfer-static
application fails.Found while adding a parallel transfer test to #685. The test is skipped in that PR. Once resolved, we'll want to enable that test.
Describe how to reproduce the problem
Include any warning or errors or releveant debugging data
There appears to be several points in the logic where multiple processes are not coordinating during the transfer.
The first issue is that only rank 0 creates destination file https://github.com/LLNL/UnifyFS/blob/3da95deb6254609932511ecdcdd484a51313623e/client/src/posix_client.c#L875-L877
This causes the non-zero ranks to fail when attempting to open the file as they can't find a local or global reference: https://github.com/LLNL/UnifyFS/blob/3da95deb6254609932511ecdcdd484a51313623e/client/src/posix_client.c#L750-L756 resulting in:
Upon changing the logic so all ranks open the file, the parallel transfer then becomes a race condition.
It appears that once one of the clients finishes writing, syncs, and laminates, that first server then broadcasts the laminate and the other servers fail with