LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
106 stars 31 forks source link

Transfer file: bug(s) in parallel transfer logic #686

Closed CamStan closed 2 years ago

CamStan commented 3 years ago

Describe the problem you're observing

Attempting to parallel transfer a file with either the unifyfs-stage or the transfer-static application fails.

Found while adding a parallel transfer test to #685. The test is skipped in that PR. Once resolved, we'll want to enable that test.

Describe how to reproduce the problem

# start unifyfs server

# create a file to transfer; needs to be large enough to avoid the logic that defaults to serial transfer for small files
export source_file=${MY_JOBDIR}/src_transfer_in.file
dd if=/dev/urandom bs=1M count=100 of=$source_file

# do a parallel transfer
export unifyfs_file=/unifyfs/im_transfer_parallel.file

jsrun -n 4-a 2 -c 18 -r 1 -e individual -o $UNIFYFS_LOG_DIR/transfer-static-in-parallel.out -k $UNIFYFS_LOG_DIR/transfer-static-in-parallel.err $UNIFYFS_EXAMPLES/transfer-static $source_file $unifyfs_file -p

Include any warning or errors or releveant debugging data

There appears to be several points in the logic where multiple processes are not coordinating during the transfer.

The first issue is that only rank 0 creates destination file https://github.com/LLNL/UnifyFS/blob/3da95deb6254609932511ecdcdd484a51313623e/client/src/posix_client.c#L875-L877

This causes the non-zero ranks to fail when attempting to open the file as they can't find a local or global reference: https://github.com/LLNL/UnifyFS/blob/3da95deb6254609932511ecdcdd484a51313623e/client/src/posix_client.c#L750-L756 resulting in:

2021-09-03T15:21:26 tid=40279 @ invoke_client_metaget_rpc() [margo_client.c:487] Got response ret=2
2021-09-03T15:21:26 tid=40279 @ transfer_file_parallel() [posix_client.c:753] failed to open() destination file /unifyfs/im_stage_parallel.file
[1] failed to transfer file (err=2)
failed to transfer file (src=/p/gpfs1/stanavig/jobs/unify/lassen/2776705/src_transfer_in.file, dst=/unifyfs/im_stage_parallel.file): Unknown error -2
data transfer failed (No such file or directory)

Upon changing the logic so all ranks open the file, the parallel transfer then becomes a race condition.

It appears that once one of the clients finishes writing, syncs, and laminates, that first server then broadcasts the laminate and the other servers fail with

2021-09-07T14:06:33 tid=61200 @ process_service_requests() [unifyfs_service_manager.c:1299] processing 1 service requests
2021-09-07T14:06:33 tid=61200 @ process_laminate_bcast_rpc() [unifyfs_service_manager.c:1157] gfid=43719796 num_extents=2
2021-09-07T14:06:33 tid=61200 @ unifyfs_inode_add_extents() [unifyfs_inode.c:363] trying to add extents to a laminated file (gfid=43719796)
2021-09-07T14:06:33 tid=61200 @ sm_add_extents() [unifyfs_service_manager.c:510] failed to add 2 extents to gfid=43719796 (rc=22, is_owner=0)
2021-09-07T14:06:33 tid=61200 @ process_laminate_bcast_rpc() [unifyfs_service_manager.c:1179] extent add during laminate(gfid=43719796) failed - rc=22
CamStan commented 2 years ago

Parallel transfer tests all ran in our CI and passed. So calling this good.