CESNET / netopeer2

NETCONF toolset
BSD 3-Clause "New" or "Revised" License
291 stars 186 forks source link

Netopeer2 multi-thread dead lock with "[ERR]: SR: Locking a mutex failed (sr_subscr_del_id: Connection timed out)." #1491

Open starx1000 opened 8 months ago

starx1000 commented 8 months ago

hi, netopeer2-version Version 2.1.71 ( latest release) , sysrepo version 2.2.105. Tried with devel branch today, the issue is still there.

when multiple client(e.g 8) connects to netopeer2-server and use established-sub to subscribe for events (multiple streams e.g 10).

After all client connected and subscription established, kill 7 clients. then netopeer2-server will get dead with log prints like "[ERR]: SR: Locking a mutex failed (sr_subscr_del_id: Connection timed out)."

This issue will happen when the clients >= 2, it is 100% rate happen when have more clients connected.

stack is like below:

Thread 8 (Thread 0x7fe4939ea6c0 (LWP 24) "netopeer2-serve"):
#0  0x00007fe4974efa17 in select () from /lib64/libc.so.6
#1  0x00007fe49761ee05 in sr_shmsub_listen_thread (arg=0x7fe48802b400) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/shm_sub.c:4753
#2  0x00007fe4974748c6 in start_thread () from /lib64/libc.so.6
#3  0x00007fe4974fa710 in clone3 () from /lib64/libc.so.6

Thread 7 (Thread 0x7fe4941ee6c0 (LWP 23) "netopeer2-serve"):
#0  0x00007fe49747116b in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1  0x00007fe49747adc2 in pthread_rwlock_wrlock@GLIBC_2.2.5 () from /lib64/libc.so.6
#2  0x0000558b24bc0c7a in sub_ntf_terminate_sub (sub=sub@entry=0x7fe4880370a0, ncs=0x7fe484075c80) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/netconf_subscribed_notifications.c:620
#3  0x0000558b24bc45ef in np2srv_rpc_establish_sub_ntf_cb (UNUSED_session=<error reading variable: value has been optimized out>, sub_id=76, sub_id@entry=<error reading variable: value has been optimized out>, notif_type=<error reading variable: value has been optimized out>, notif=<error reading variable: value has been optimized out>, timestamp=<error reading variable: value has been optimized out>, private_data=0x7fe488010f50, private_data@entry=<error reading variable: value has been optimized out>) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/subscribed_notifications.c:130
#4  0x00007fe4975fe40b in sr_notif_call_callback (ev_sess=0x7fe4840f62b0, cb=0x0, tree_cb=<optimized out>, private_data=0x7fe488010f50, notif_type=notif_type@entry=SR_EV_NOTIF_TERMINATED, sub_id=76, notif_op=0x0, notif_ts=0x7fe4941ed910) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/subscr.c:1839
#5  0x00007fe4975fe78f in sr_subscr_notif_sub_del (subscr=0x7fe48802b400, sub_id=76, notif_ev=SR_EV_NOTIF_TERMINATED) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/subscr.c:570
#6  0x00007fe497625dbf in sr_notif_sub_del.constprop.0 (subscr=0x7fe48802b400, idx1=0, idx2=1, notif_ev=SR_EV_NOTIF_TERMINATED, has_subs_lock=SR_LOCK_READ) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/subscr.c:1280
#7  0x00007fe4975e04c1 in sr_subscr_del_id (sub_id=76, subscr=0x7fe48802b400) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/subscr.c:1628
#8  sr_unsubscribe_sub (sub_id=76, subscription=0x7fe48802b400) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/sysrepo.c:4752
#9  sr_unsubscribe_sub (subscription=0x7fe48802b400, sub_id=76) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/sysrepo.c:4739
#10 0x0000558b24bc0d6c in sub_ntf_terminate_sub (sub=0x7fe4880370a0, ncs=ncs@entry=0x7fe484075c80) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/netconf_subscribed_notifications.c:575
#11 0x0000558b24bc0fe2 in np2srv_sub_ntf_session_destroy (ncs=0x7fe484075c80) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/netconf_subscribed_notifications.c:233
#12 0x0000558b24bb8413 in np2srv_del_session_cb (session=0x7fe484075c80) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/main.c:113
#13 0x0000558b24bb89ee in worker_thread (arg=0x558b2594ecb0) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/main.c:1062
#14 0x00007fe4974748c6 in start_thread () from /lib64/libc.so.6
#15 0x00007fe4974fa710 in clone3 () from /lib64/libc.so.6v

Thread 6 (Thread 0x7fe4949ef6c0 (LWP 22) "netopeer2-serve"):
#0  0x00007fe49747116b in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1  0x00007fe497478276 in __pthread_mutex_clocklock_common () from /lib64/libc.so.6
#2  0x00007fe4975f8dcd in sr_sub_rwlock (rwlock=0x7fe48802b420, timeout_abs=0x7fe4949eea30, mode=SR_LOCK_READ, cid=4, func=0x7fe49762e040 <__func__.1.lto_priv.2> "sr_subscr_del_id", cb=0x0, cb_data=0x0, has_mutex=0) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/common.c:2496
#3  0x00007fe4975e01af in sr_rwlock (cb_data=0x0, cb=0x0, func=0x7fe49762e040 <__func__.1.lto_priv.2> "sr_subscr_del_id", cid=<optimized out>, mode=SR_LOCK_READ, timeout_ms=30000, rwlock=0x7fe48802b420) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/common.c:2679
--Type <RET> for more, q to quit, c to continue without paging--
#4  sr_subscr_del_id (sub_id=88, subscr=0x7fe48802b400) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/subscr.c:1567
#5  sr_unsubscribe_sub (sub_id=88, subscription=0x7fe48802b400) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/sysrepo.c:4752
#6  sr_unsubscribe_sub (subscription=0x7fe48802b400, sub_id=88) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/sysrepo.c:4739
#7  0x0000558b24bc0d6c in sub_ntf_terminate_sub (sub=0x7fe488037460, ncs=ncs@entry=0x558b2590f270) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/netconf_subscribed_notifications.c:575
#8  0x0000558b24bc0fe2 in np2srv_sub_ntf_session_destroy (ncs=0x558b2590f270) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/netconf_subscribed_notifications.c:233
#9  0x0000558b24bb8413 in np2srv_del_session_cb (session=0x558b2590f270) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/main.c:113
#10 0x0000558b24bb89ee in worker_thread (arg=0x558b259583a0) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/main.c:1062
#11 0x00007fe4974748c6 in start_thread () from /lib64/libc.so.6
#12 0x00007fe4974fa710 in clone3 () from /lib64/libc.so.6

Thread 5 (Thread 0x7fe4951f36c0 (LWP 21) "netopeer2-serve"):
#0  0x00007fe4974efa17 in select () from /lib64/libc.so.6
#1  0x00007fe49761ee05 in sr_shmsub_listen_thread (arg=0x558b25783280) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/shm_sub.c:4753
#2  0x00007fe4974748c6 in start_thread () from /lib64/libc.so.6
#3  0x00007fe4974fa710 in clone3 () from /lib64/libc.so.6

Thread 4 (Thread 0x7fe495a016c0 (LWP 20) "netopeer2-serve"):
#0  0x00007fe4974efa17 in select () from /lib64/libc.so.6
#1  0x00007fe49761ee05 in sr_shmsub_listen_thread (arg=0x558b258f1a60) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/shm_sub.c:4753
#2  0x00007fe4974748c6 in start_thread () from /lib64/libc.so.6
#3  0x00007fe4974fa710 in clone3 () from /lib64/libc.so.6

Thread 3 (Thread 0x7fe49620a6c0 (LWP 19) "netopeer2-serve"):
#0  0x00007fe4974efa17 in select () from /lib64/libc.so.6
#1  0x00007fe49761ee05 in sr_shmsub_listen_thread (arg=0x558b25796e20) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/shm_sub.c:4753
#2  0x00007fe4974748c6 in start_thread () from /lib64/libc.so.6
#3  0x00007fe4974fa710 in clone3 () from /lib64/libc.so.6

Thread 2 (Thread 0x7fe496a136c0 (LWP 18) "netopeer2-serve"):
#0  0x00007fe4974efa17 in select () from /lib64/libc.so.6
#1  0x00007fe49761ee05 in sr_shmsub_listen_thread (arg=0x558b25787210) at /usr/src/debug/sysrepo-2.2.105-1.fc38.x86_64/src/shm_sub.c:4753
#2  0x00007fe4974748c6 in start_thread () from /lib64/libc.so.6
#3  0x00007fe4974fa710 in clone3 () from /lib64/libc.so.6

Thread 1 (Thread 0x7fe496a78a00 (LWP 17) "netopeer2-serve"):
#0  0x00007fe49747116b in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1  0x00007fe49747adc2 in pthread_rwlock_wrlock@GLIBC_2.2.5 () from /lib64/libc.so.6
#2  0x0000558b24bc0f87 in np2srv_sub_ntf_session_destroy (ncs=0x7fe48c01c110) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/netconf_subscribed_notifications.c:228
--Type <RET> for more, q to quit, c to continue without paging--
#3  0x0000558b24bb8413 in np2srv_del_session_cb (session=0x7fe48c01c110) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/main.c:113
#4  0x0000558b24bb89ee in worker_thread (arg=0x558b259077c0) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/main.c:1062
#5  0x0000558b24bb1b87 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/netopeer2-2.1.71-1.fc38.x86_64/src/main.c:1413

can you please help for this? thank you very much.

michalvasko commented 8 months ago

I did reproduce some problems but since I am working on a major refactoring of these subscriptions, it does not make much sense fixing this. Once I am done, I will test this use-case as well but you will have to wait for several weeks, at least, until that,

starx1000 commented 8 months ago

That is ok, thank you!
before that I plan to turn off multi-threads by "-DTHREAD_COUNT=1 ". Looks like this will workaround the issue. Do you see any side effects for "-DTHREAD_COUNT=1 "?

michalvasko commented 8 months ago

Yes, no parallel processing of RPCs/accepting new sessions. But other than that it should work fine.

starx1000 commented 5 months ago

@michalvasko I see new versions of netopeer2 is published, is this refactoring of subscriptions done in the latest published version Version 2.2.13?

thank you in advance.

michalvasko commented 5 months ago

Yes, the subscriptions were refactored so you can try it again. If you encounter any problems, please include a script or some way of reproducing the scenario.

starx1000 commented 5 months ago

OK, thank you. I will try and update you result..