LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
99 stars 31 forks source link

H5Bench small read benchmark failing with Mercury timeout #790

Closed CamStan closed 3 months ago

CamStan commented 1 year ago

Describe the problem you're observing

Running the H5Bench sync-write-read-contig-1d-small benchmark at some scales results in the Mercury timeout:

margo_forward_timed() failed - HG_TIMEOUT

Increasing UNIFYFS_MARGO_CLIENT_TIMEOUT fixed two different tests having this issue, but seems to have no effect on this one.

This error occurs using 1N/1n, 2N/2n, and 8N/32n. However, this same test passes using 4N/8n and 16N/128n without needing to increase UNIFYFS_MARGO_CLIENT_TIMEOUT at all.

Error occurs on Frontier, haven't attempted to reproduce on other systems yet.

Describe how to reproduce the problem

Same h5bench config will work for both write and read:

$ cat h5bench.cfg
MEM_PATTERN=CONTIG
FILE_PATTERN=CONTIG
TIMESTEPS=5
DELAYED_CLOSE_TIMESTEPS=2
COLLECTIVE_DATA=YES
COLLECTIVE_METADATA=YES
EMULATED_COMPUTE_TIME_PER_TIMESTEP=5 s
NUM_DIMS=1
DIM_1=16777216
DIM_2=1
DIM_3=1
CSV_FILE=/unifyfs/sync-write-read-contig-1d-small/output.csv

Set up environment and UnifyFS:

ml gcc/12.2.0
ml PrgEnv-gnu
ml cray-hdf5-parallel/1.12.2.1
module unload darshan-runtime
module use /sw/frontier/unifyfs/modulefiles
module load unifyfs/1.1/gcc-12.2.0

export UNIFYFS_LOG_VERBOSITY=5

Start UnifyFS and then run h5bench_write followed by h5bench_read:

srun -N $nnodes -n $nprocs -e sync-write-read-contig-1d-small-%n_write.err -o sync-write-read-contig-1d-small_write.out --export=ALL,LD_PRELOAD=/path/to/lib/libunifyfs_mpi_gotcha.so /path/to/h5bench_write /path/to/h5bench.cfg /unifyfs/test.h5

sleep 10

srun -N $nnodes -n $nprocs -e sync-write-read-contig-1d-small-%n_read.err -o sync-write-read-contig-1d-small_read.out --export=ALL,LD_PRELOAD=/path/to/lib/libunifyfs_mpi_gotcha.so /path/to/h5bench_read /path/to/h5bench.cfg /unifyfs/test.h5

Include any warning or errors or releveant debugging data

Increased UNIFYFS_MARGO_CLIENT_TIMEOUT multiple times (up to 10 minutes as shown here), but still get the same error.

2023-08-01T14:27:21 tid=43979 @ unifyfs_initialize() [unifyfs_api.c:256] calling attach rpc
2023-08-01T14:27:21 tid=43979 @ invoke_client_attach_rpc() [margo_client.c:343] invoking the attach rpc function in client
2023-08-01T14:37:21 tid=43979 @ forward_to_server() [margo_client.c:234] margo_forward_timed() failed - HG_TIMEOUT
2023-08-01T14:37:21 tid=43979 @ invoke_client_attach_rpc() [margo_client.c:347] forward of attach rpc to server failed
2023-08-01T14:37:21 tid=43979 @ unifyfs_initialize() [unifyfs_api.c:260] failed to attach to server
2023-08-01T14:37:21 tid=43979 @ get_page_size() [../../common/src/unifyfs_logio.c:83] returning page size 4096 B
2023-08-01T14:37:21 tid=43979 @ posix_client_init() [posix_client.c:460] client initialization failed - "Mercury/Argobots operation error"
UNIFYFS ERROR: unifyfs_mount(/unifyfs) failed with '"Mercury/Argobots operation error"'

Server log around same time as the HG_TIMEOUT:

2023-08-01T14:27:21 tid=43934 @ unifyfs_mount_rpc() [unifyfs_client_rpc.c:148] using existing app_config for app_id=1511587981
2023-08-01T14:27:21 tid=43934 @ unifyfs_mount_rpc() [unifyfs_client_rpc.c:152] creating new app client for na+sm://43979-0
2023-08-01T14:27:21 tid=43934 @ unifyfs_mount_rpc() [unifyfs_client_rpc.c:162] created new application client 1511587981:2
2023-08-01T14:27:21 tid=43982 @ request_manager_thread() [unifyfs_request_manager.c:1738] I am request manager [app=1511587981:client=2] thread!
2023-08-01T14:27:21 tid=43982 @ rm_heartbeat() [unifyfs_request_manager.c:1710] sending heartbeat rpc
2023-08-01T14:27:21 tid=43982 @ invoke_client_heartbeat_rpc() [margo_server.c:632] invoking the heartbeat rpc function in client[1511587981:2]
2023-08-01T14:27:21 tid=43982 @ invoke_client_heartbeat_rpc() [margo_server.c:646] Got response ret=22
2023-08-01T14:27:21 tid=43982 @ rm_heartbeat() [unifyfs_request_manager.c:1716] heartbeat rpc for client[1511587981:2] failed
2023-08-01T14:27:21 tid=43982 @ request_manager_thread() [unifyfs_request_manager.c:1810] RM[1511587981:2] thread exiting
2023-08-01T14:27:22 tid=43931 @ process_client_failures() [unifyfs_server.c:259] processing 1 client failures
2023-08-01T14:27:22 tid=43931 @ cleanup_app_client() [unifyfs_server.c:1075] cleaning application client 1511587981:2
2023-08-01T14:37:31 tid=43931 @ exit_request() [unifyfs_server.c:149] exit requested
2023-08-01T14:37:31 tid=43931 @ main() [unifyfs_server.c:506] starting service shutdown
CamStan commented 1 year ago

785 may fix this. Test again after merging.

wangvsa commented 4 months ago

Just tried this on Corona. Didn't see any issues.

CamStan commented 3 months ago

Thanks for testing this @wangvsa