Occasional rpc initialization failure in client

sandrain commented 4 years ago

System information

Type	Version/Name
Operating System	RHEL74
OS Version	3.10.0
Architecture	x86_64
UnifyFS Version	dev

Describe the problem you're observing

RPC initialization occasionally fails. This seems to happen when client applications are launched multiple times against a single server instance. It occasionally fails with

# NA -- Error -- /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:1066
 # na_sm_create_sock(): bind() socket (Address already in use)

And unifyfs doesn't seem to check this failure, but continues its execution.

Describe how to reproduce the problem

Launch a server, then run client application (in examples) multiple times. I haven't figured out how I can deterministically reproduce this.

Include any warning or errors or releveant debugging data

The output from client, with sysio-write-static:

2020-05-03T14:03:37 tid=3616 @ __wrap_stat() [client/src/unifyfs-sysio.c:417] stat was called for /var/tmp/kvstore
2020-05-03T14:03:37 tid=3616 @ __wrap___xstat() [client/src/unifyfs-sysio.c:457] xstat was called for /var/tmp/kvstore
2020-05-03T14:03:37 tid=3616 @ rpc_lookup_local_server_addr() [common/src/unifyfs_rpc_util.c:85] found local server rpc address 'na+sm://3457/0'
2020-05-03T14:03:37 tid=3616 @ unifyfs_client_rpc_init() [client/src/margo_client.c:100] svr_addr:'na+sm://3457/0' proto:'na+sm'
# NA -- Error -- /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:1066
 # na_sm_create_sock(): bind() socket (Address already in use)
# NA -- Error -- /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:1374
 # na_sm_setup_shm(): Could not create sock
# NA -- Error -- /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:2514
 # na_sm_initialize(): Could not setup shm
# NA -- Error -- /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na.c:356
 # NA_Initialize_opt(): Could not initialize plugin
# HG -- Error -- /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury_core.c:1135
 # hg_core_init(): Could not initialize NA class
# HG -- Error -- /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury_core.c:3597
 # HG_Core_init_opt(): Cannot initialize HG core layer
# HG -- Error -- /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury.c:1092
 # HG_Init_opt(): Could not create HG core class
# HG -- Error -- /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury_core.c:3675
 # HG_Core_context_create_id(): NULL HG core class
# HG -- Error -- /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury.c:1188
 # HG_Context_create_id(): Could not create context for ID 0
# HG -- Error -- /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury_core.c:4448
 # HG_Core_create(): NULL HG core context
# HG -- Error -- /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury.c:1742
 # HG_Create(): Cannot create HG handle with ID 0
2020-05-03T14:03:37 tid=30278 @ unifyfs_mount() [client/src/unifyfs.c:2482] calling mount rpc
2020-05-03T14:03:37 tid=30278 @ invoke_client_mount_rpc() [client/src/margo_client.c:267] invoking the mount rpc function in client
2020-05-03T14:03:37 tid=29708 @ unifyfs_mount() [client/src/unifyfs.c:2482] calling mount rpc
2020-05-03T14:03:37 tid=29708 @ invoke_client_mount_rpc() [client/src/margo_client.c:267] invoking the mount rpc function in client
2020-05-03T14:03:37 tid=29454 @ unifyfs_mount() [client/src/unifyfs.c:2482] calling mount rpc
2020-05-03T14:03:37 tid=29454 @ invoke_client_mount_rpc() [client/src/margo_client.c:267] invoking the mount rpc function in client
2020-05-03T14:03:37 tid=29536 @ unifyfs_mount() [client/src/unifyfs.c:2482] calling mount rpc
2020-05-03T14:03:37 tid=29536 @ invoke_client_mount_rpc() [client/src/margo_client.c:267] invoking the mount rpc function in client
2020-05-03T14:03:37 tid=12141 @ unifyfs_mount() [client/src/unifyfs.c:2482] calling mount rpc
2020-05-03T14:03:37 tid=12141 @ invoke_client_mount_rpc() [client/src/margo_client.c:267] invoking the mount rpc function in client
2020-05-03T14:03:37 tid=3620 @ unifyfs_mount() [client/src/unifyfs.c:2482] calling mount rpc
2020-05-03T14:03:37 tid=3620 @ invoke_client_mount_rpc() [client/src/margo_client.c:267] invoking the mount rpc function in client
2020-05-03T14:03:38 tid=30278 @ invoke_client_mount_rpc() [client/src/margo_client.c:280] Got response ret=0
2020-05-03T14:03:38 tid=30278 @ invoke_client_mount_rpc() [client/src/margo_client.c:284] set unifyfs_key_slice_range=1048576
2020-05-03T14:03:38 tid=30278 @ unifyfs_init() [client/src/unifyfs.c:2242] FD limit for system = 1024

sandrain commented 4 years ago

Additionally, when many clients are trying to connect the server (e.g., -ppn=32), the server fails with a segmentation fault. For instance, I launched server daemons using 6 servers, waited for their initializations, and then launch the sysio-write-static with mpirun -n 192 -ppn 32 ... (the machine has 32 cores). Most of the times the server is killed from a segmentation fault:

(unifyfs) [root@rage17 UnifyFS]$ !gdb
gdb --pid=$(pidof unifyfsd | awk '{print $1}')
Attaching to process 9015
[New LWP 9016]
[New LWP 9017]
[New LWP 9018]
[New LWP 9019]
[New LWP 9020]
[New LWP 9021]
[New LWP 9022]
[New LWP 9023]
[New LWP 9024]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fbea674384d in nanosleep () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64 leveldb-1.12.0-11.el7.x86_64 libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libgfortran-4.8.5-39.el7.x86_64 libquadmath-4.8.5-39.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libstdc++-4.8.5
(gdb) c
Continuing.
[New Thread 0x7fbe917fb700 (LWP 9145)]
[New Thread 0x7fbe90ffa700 (LWP 9146)]

Thread 2 "unifyfsd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fbea3788700 (LWP 9016)]
0x00007fbea78ff4ec in na_sm_addr_free (na_class=0x25f9170, addr=0x7fbe9c0013d0)
    at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:2843
2843            HG_QUEUE_REMOVE(&NA_SM_CLASS(na_class)->accepted_addr_queue, na_sm_addr,
(gdb) bt
#0  0x00007fbea78ff4ec in na_sm_addr_free (na_class=0x25f9170, addr=0x7fbe9c0013d0)
    at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:2843
#1  0x00007fbea78fce86 in na_sm_progress_error (na_class=0x25f9170, poll_addr=0x7fbe9c0013d0, error=16)
    at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:1852
#2  0x00007fbea78fcc45 in na_sm_progress_cb (arg=0x7fbe9c001460, timeout=100, error=16, 
    progressed=0x7fbea288808f "")
    at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:1796
#3  0x00007fbea7b0c14f in hg_poll_wait (poll_set=0x25f92e0, timeout=100, progressed=0x7fbea28880ff "")
    at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/util/mercury_poll.c:465
#4  0x00007fbea790194a in na_sm_progress (na_class=0x25f9170, context=0x25fbc40, timeout=100)
    at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:3943
#5  0x00007fbea78f63e7 in NA_Progress (na_class=0x25f9170, context=0x25fbc40, timeout=100)
    at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na.c:1413
#6  0x00007fbea7f27d7c in hg_core_progress_na_cb (arg=0x25f9a70, timeout=100, error=0, 
    progressed=0x7fbea288856f "")
    at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury_core.c:3016
#7  0x00007fbea7b0c14f in hg_poll_wait (poll_set=0x25f91f0, timeout=100, progressed=0x7fbea28885cf "")
    at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/util/mercury_poll.c:465
#8  0x00007fbea7f281e3 in hg_core_progress_poll (context=0x25f9a70, timeout=100)
    at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury_core.c:3258
#9  0x00007fbea7f2aef8 in HG_Core_progress (context=0x25f9a70, timeout=100)
    at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury_core.c:4824
#10 0x00007fbea7f1f007 in HG_Progress (context=0x25f91a0, timeout=100)
    at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury.c:2173
#11 0x00007fbea833c95b in hg_progress_fn (foo=0x27b6820) at src/margo.c:1281
#12 0x00007fbea76e170b in ABTD_thread_func_wrapper_thread ()
   from /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/prefix/lib/libabt.so.0
#13 0x00007fbea76e1db1 in make_fcontext ()
   from /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/prefix/lib/libabt.so.0
#14 0x0000000000000000 in ?? ()
(gdb)

MichaelBrim commented 4 years ago

@sandrain It appears this stacktrace is entirely within one of the margo threads (i.e., no Unify code involved). Do you know which version of mercury this is using? Maybe we could create a standalone margo test that has the same behavior and report back.

sandrain commented 4 years ago

These are versions that I use:

argobot: v1.0rc2
mercury: v1.0.1
margo: v0.4.3

I will see if I can test with a separate example.

MichaelBrim commented 4 years ago

@sandrain It would be good to see if you can reproduce this failure using write or writeread. The sysio examples are close to being deprecated at this point.

Another thing I wanted to mention is the "Address already in use" error is typically because a previous server run failed and left around the mercury domain socket. On Summitdev/Summit, we run clean_job_nodes.bash before starting the servers to remove any cruft left around from a previous bad run. You may want to do something similar in your TechInt testbed runs.

sandrain commented 4 years ago

This doesn't seem to be a problem in unifyfs code but in the rpc library, because the problem cannot be reproduced in another environment.

LLNL / UnifyFS