LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
99 stars 31 forks source link

PnetCDF test leads to margo error, which leads to hang in ROMIO #783

Open adammoody opened 1 year ago

adammoody commented 1 year ago

While running a particular margo test

https://github.com/Parallel-NetCDF/PnetCDF/blob/master/test/largefile/high_dim_var.c

with 4 ranks on 2 nodes, a read from rank 2 invokes a failure on the server, which generates the following logs:

023-07-06T16:00:56 tid=872735 @ signal_new_requests() [unifyfs_request_manager.c:269] signaling new requests
2023-07-06T16:00:56 tid=873012 @ request_manager_thread() [unifyfs_request_manager.c:1802] RM[1511587981:1] got work
2023-07-06T16:00:56 tid=873012 @ rm_process_client_requests() [unifyfs_request_manager.c:1631] processing 1 client requests
2023-07-06T16:00:56 tid=873012 @ process_read_rpc() [unifyfs_request_manager.c:1324] processing mread[0] with 1 requests
2023-07-06T16:00:56 tid=873012 @ submit_read_request() [unifyfs_fops_rpc.c:252] handling read request (1 extents)
2023-07-06T16:00:56 tid=873012 @ pull_margo_bulk_buffer() [../../common/src/unifyfs_rpc_util.c:179] margo_bulk_transfer(buf_offset=0, len=1572864) failed
2023-07-06T16:00:56 tid=873012 @ pull_margo_bulk_buffer() [../../common/src/unifyfs_rpc_util.c:197] failed bulk transfer - transferred 0 of 1572864 bytes
2023-07-06T16:00:56 tid=873012 @ unifyfs_invoke_find_extents_rpc() [unifyfs_p2p_rpc.c:665] failed to get bulk chunk locations
2023-07-06T16:00:56 tid=873012 @ submit_read_request() [unifyfs_fops_rpc.c:279] failed to find extent locations
2023-07-06T16:00:56 tid=873012 @ process_read_rpc() [unifyfs_request_manager.c:1333] unifyfs_fops_read() failed
2023-07-06T16:00:56 tid=873012 @ rm_process_client_requests() [unifyfs_request_manager.c:1690] client rpc request 0 failed ("Mercury/Argobots operation error")
2023-07-06T16:00:56 tid=873012 @ request_manager_thread() [unifyfs_request_manager.c:1768] failed to process client rpc requests

The error code returned to the client for the read is 1004. That probably corresponds to one of these:

https://github.com/mercury-hpc/mercury/blob/55b95f72714bb0e4e0deeedf4fd78d116ea9476a/src/mercury_core_types.h#L102-L108

The read error happens during PMI_File_read_at_all which then leads to a deadlock in ROMIO: https://github.com/pmodels/mpich/issues/6585