Closed sandrain closed 4 years ago
Additionally, when many clients are trying to connect the server (e.g., -ppn=32
), the server fails with a segmentation fault. For instance, I launched server daemons using 6 servers, waited for their initializations, and then launch the sysio-write-static
with mpirun -n 192 -ppn 32 ...
(the machine has 32 cores). Most of the times the server is killed from a segmentation fault:
(unifyfs) [root@rage17 UnifyFS]$ !gdb
gdb --pid=$(pidof unifyfsd | awk '{print $1}')
Attaching to process 9015
[New LWP 9016]
[New LWP 9017]
[New LWP 9018]
[New LWP 9019]
[New LWP 9020]
[New LWP 9021]
[New LWP 9022]
[New LWP 9023]
[New LWP 9024]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fbea674384d in nanosleep () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64 leveldb-1.12.0-11.el7.x86_64 libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libgfortran-4.8.5-39.el7.x86_64 libquadmath-4.8.5-39.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libstdc++-4.8.5
(gdb) c
Continuing.
[New Thread 0x7fbe917fb700 (LWP 9145)]
[New Thread 0x7fbe90ffa700 (LWP 9146)]
Thread 2 "unifyfsd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fbea3788700 (LWP 9016)]
0x00007fbea78ff4ec in na_sm_addr_free (na_class=0x25f9170, addr=0x7fbe9c0013d0)
at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:2843
2843 HG_QUEUE_REMOVE(&NA_SM_CLASS(na_class)->accepted_addr_queue, na_sm_addr,
(gdb) bt
#0 0x00007fbea78ff4ec in na_sm_addr_free (na_class=0x25f9170, addr=0x7fbe9c0013d0)
at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:2843
#1 0x00007fbea78fce86 in na_sm_progress_error (na_class=0x25f9170, poll_addr=0x7fbe9c0013d0, error=16)
at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:1852
#2 0x00007fbea78fcc45 in na_sm_progress_cb (arg=0x7fbe9c001460, timeout=100, error=16,
progressed=0x7fbea288808f "")
at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:1796
#3 0x00007fbea7b0c14f in hg_poll_wait (poll_set=0x25f92e0, timeout=100, progressed=0x7fbea28880ff "")
at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/util/mercury_poll.c:465
#4 0x00007fbea790194a in na_sm_progress (na_class=0x25f9170, context=0x25fbc40, timeout=100)
at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na_sm.c:3943
#5 0x00007fbea78f63e7 in NA_Progress (na_class=0x25f9170, context=0x25fbc40, timeout=100)
at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/na/na.c:1413
#6 0x00007fbea7f27d7c in hg_core_progress_na_cb (arg=0x25f9a70, timeout=100, error=0,
progressed=0x7fbea288856f "")
at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury_core.c:3016
#7 0x00007fbea7b0c14f in hg_poll_wait (poll_set=0x25f91f0, timeout=100, progressed=0x7fbea28885cf "")
at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/util/mercury_poll.c:465
#8 0x00007fbea7f281e3 in hg_core_progress_poll (context=0x25f9a70, timeout=100)
at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury_core.c:3258
#9 0x00007fbea7f2aef8 in HG_Core_progress (context=0x25f9a70, timeout=100)
at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury_core.c:4824
#10 0x00007fbea7f1f007 in HG_Progress (context=0x25f91a0, timeout=100)
at /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/deps/mercury/src/mercury.c:2173
#11 0x00007fbea833c95b in hg_progress_fn (foo=0x27b6820) at src/margo.c:1281
#12 0x00007fbea76e170b in ABTD_thread_func_wrapper_thread ()
from /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/prefix/lib/libabt.so.0
#13 0x00007fbea76e1db1 in make_fcontext ()
from /ccs/techint/home/hs2/projects/UnifyFS/__run/rage/prefix/lib/libabt.so.0
#14 0x0000000000000000 in ?? ()
(gdb)
@sandrain It appears this stacktrace is entirely within one of the margo threads (i.e., no Unify code involved). Do you know which version of mercury this is using? Maybe we could create a standalone margo test that has the same behavior and report back.
These are versions that I use:
I will see if I can test with a separate example.
@sandrain It would be good to see if you can reproduce this failure using write or writeread. The sysio examples are close to being deprecated at this point.
Another thing I wanted to mention is the "Address already in use" error is typically because a previous server run failed and left around the mercury domain socket. On Summitdev/Summit, we run clean_job_nodes.bash before starting the servers to remove any cruft left around from a previous bad run. You may want to do something similar in your TechInt testbed runs.
This doesn't seem to be a problem in unifyfs code but in the rpc library, because the problem cannot be reproduced in another environment.
System information
Describe the problem you're observing
RPC initialization occasionally fails. This seems to happen when client applications are launched multiple times against a single server instance. It occasionally fails with
And unifyfs doesn't seem to check this failure, but continues its execution.
Describe how to reproduce the problem
Launch a server, then run client application (in examples) multiple times. I haven't figured out how I can deterministically reproduce this.
Include any warning or errors or releveant debugging data
The output from client, with
sysio-write-static
: