Open clmendes opened 4 years ago
@MichaelBrim , would you be able to try the code above on two nodes of Summit? I had run it on LLNL's Quartz, and am curious to see if a multi-node execution on Summit would also take nearly two minutes for those 10 remote Read operations! I ran this on Quartz using a Unify version based on the new margotree branch.
@clmendes I'll see if I can get results before this afternoon's meeting, but no guarantee.
@clmendes , do you know if lassen behaves the same?
@adammoody : I don't really know the status on Lassen, I never used it. I'm supposed to have an account, and could try it, but I'd have to build Unify first, even for this simple non-HDF5 test.
Seems like they switched to using OMPIO rather than ROMIO in Summit's Spectrum MPI. I'm not sure if our UnifyFS support for MPI-IO on Summit is still functional.
I can't remember if I've tried this with Spectrum, but with Open MPI I know one can switch to from OMPIO to ROMIO via an MCA parameter, e.g., mpirun --mca io romio314
.
@adammoody On summit, the romio mca library is missing from the install, so I don't think I have a choice.
As planned with @adammoody and @kathrynmohror I am creating this issue to document a problem that I am facing in one of the Parallel-HDF5 tests, fltread (parallel read of dataset written serially with filters). That test itself is now passing, as long as I reduce the dataset size to a small enough size such that the execution can finish in reasonable time. Even with that small size, the execution on two nodes takes more than 4 minutes, whereas the execution in the same node (with 4 processors in both cases) finished in only 2 seconds!
I reproduced the essence of the issue with the test program below, which is independent of HDF5, and is based only on MPI-IO. This program writes data to a file by rank=0 only, then all ranks read from the file. The real HDF5 test repeats this several times, but the test program below is sufficient to reproduce the basic behavior that is problematic.
This test program below, when run with 4 processors in the same node, will finish in less than a second. However, when running with 4 processors in 2 nodes (i.e. 2 processors/node), it takes nearly two minutes! These are the first and last records of the client log from that multi-node execution:
2020-11-01T17:15:22 tid=65212 @ rpc_lookup_local_server_addr() . . . 2020-11-01T17:17:13 tid=65213 @ __wrap_write() [unifyfs-sysio.c:1422]
Thus, it took 111 seconds between those first and last records! It can be noticed that each rank will do 10 Read operations in this program. The Reads by ranks 0 and 1 are fast, and their timestamps are identical:
For Rank=0: 2020-11-01T17:15:22 tid=10428 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10428 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10428 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10428 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10428 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10428 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10428 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10428 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10428 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10428 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes)
For Rank=1: 2020-11-01T17:15:22 tid=10429 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10429 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10429 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10429 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10429 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10429 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10429 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10429 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10429 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:22 tid=10429 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes)
However, this is very different for Ranks 2 and 3!
For Rank=2: 2020-11-01T17:15:23 tid=65212 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:29 tid=65212 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:33 tid=65212 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:35 tid=65212 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:16:08 tid=65212 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:16:26 tid=65212 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:17:12 tid=65212 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:17:12 tid=65212 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:17:13 tid=65212 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:17:13 tid=65212 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes)
For Rank=3: 2020-11-01T17:15:23 tid=65213 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:29 tid=65213 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:33 tid=65213 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:15:35 tid=65213 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:16:08 tid=65213 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:16:26 tid=65213 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:17:12 tid=65213 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:17:12 tid=65213 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:17:13 tid=65213 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes) 2020-11-01T17:17:13 tid=65213 @ process_read_data() [unifyfs.c:753] copied data to application buffer (20 bytes)
Hence, as shown above, it takes 33 seconds between the fourth and fifth Reads, on both ranks, which were running in the remote node (the file had been written by rank=0 only). By looking at the log for Rank=3 during that interval, there are the following records:
2020-11-01T17:15:35 tid=65213 @ invoke_client_read_rpc() [margo_client.c:595] invoking the read rpc function in client 2020-11-01T17:15:35 tid=65213 @ invoke_client_read_rpc() [margo_client.c:604] Got response ret=0 2020-11-01T17:16:08 tid=65213 @ process_read_data() [unifyfs.c:669] processing data response from server: [0] (gfid=1030688322, offset=880, length=20, errcode=0)
Thus, the delay of 33 seconds seems to be caused by the long wait for the server response. The server in that second node also has a jump in the timestamps in its logs around that time:
2020-11-01T17:15:35 tid=65345 @ rm_request_remote_chunks() [unifyfs_request_manager.c:623] [0 of 1] sending 1 chunk requests to server 0 2020-11-01T17:15:35 tid=65345 @ invoke_chunk_read_request_rpc() [unifyfs_request_manager.c:1189] invoking the chunk-read-request rpc function 2020-11-01T17:16:08 tid=65344 @ invoke_chunk_read_request_rpc() [unifyfs_request_manager.c:1199] Got request rpc response from 0 - ret=0 2020-11-01T17:16:08 tid=65345 @ invoke_chunk_read_request_rpc() [unifyfs_request_manager.c:1199] Got request rpc response from 0 - ret=0 2020-11-01T17:16:08 tid=65190 @ chunk_read_response_rpc() [unifyfs_request_manager.c:1235] received chunk read response from server 0 (1 chunks)
Meanwhile, on the server of the first node, there is the following in the log:
2020-11-01T17:15:35 tid=10368 @ sm_issue_chunk_reads() [unifyfs_service_manager.c:155] issuing 1 requests, total data size = 20 2020-11-01T17:15:35 tid=10368 @ sm_issue_chunk_reads() [unifyfs_service_manager.c:180] reading chunk(offset=880, size=20) 2020-11-01T17:15:35 tid=10368 @ sm_issue_chunk_reads() [unifyfs_service_manager.c:218] adding to svcmgr chunk_reads 2020-11-01T17:15:35 tid=10368 @ sm_issue_chunk_reads() [unifyfs_service_manager.c:221] locking service manager state 2020-11-01T17:16:08 tid=10365 @ sm_issue_chunk_reads() [unifyfs_service_manager.c:223] unlocking service manager state 2020-11-01T17:16:08 tid=10365 @ sm_issue_chunk_reads() [unifyfs_service_manager.c:227] done adding to svcmgr chunk_reads 2020-11-01T17:16:08 tid=10368 @ sm_issue_chunk_reads() [unifyfs_service_manager.c:223] unlocking service manager state 2020-11-01T17:16:08 tid=10368 @ sm_issue_chunk_reads() [unifyfs_service_manager.c:227] done adding to svcmgr chunk_reads
This suggests the wait is also faced by the Server, as it is seen between the locking/unlocking events in the Server. Lines 221/222/223 in unifyfs_service_manager.c are as follows:
Function arraylist_add (under common/src/arraylist.c) is simple, so the 33 seconds of wait can only be coming from the SM_LOCK() operation.
The test code is shown below. It must be run on FOUR processors in TWO nodes.