Closed Krisscut closed 3 months ago
Seems like a really complex use-case with a difficult reproducer so I am not sure I can do anything about this, definitely not without more information. It would be great if you could get the backtrace or valgrind
output for the crash
2023-10-10 06:05:18,399 INFO exited: netopeer2-server (terminated by SIGSEGV (core dumped); not expected)
but a core dump should be fine, too.
Yes it's quite the edge case scenario indeed, and it's not reproductible 100% of the times, but it still happen occasionally.
I attached it to this message.
I failed to load the core dump properly for some reason so please provide the stack trace of all the threads at the moment of the crash or simply the valgrind
output (if you manage to reproduce it with it).
Coredumps are not portable; it's just a dump of the memory image of the process. Without all the libs in exact same places (or a lot of manual work), it's not usable.
Naturally, but even using the binary from the archive did not help much and provided almost no debugging information.
It might be because it wasn't compiled in debug mode or something, I will check the build process to see if I can enable debug options & have the full backtrace etc
It's a dynamically linked binary and the dynamic libraries are missing, so there's going to be a ton of symbols missing. Anyway, rather than wasting the time here with sending binaries around, why don't you run the server under a debugger and print out the backtrace once it crashes?
It's a dynamically linked binary and the dynamic libraries are missing, so there's going to be a ton of symbols missing. Anyway, rather than wasting the time here with sending binaries around, why don't you run the server under a debugger and print out the backtrace once it crashes?
Yes that's the plan, as stated above I will check the compilation mode to ensure to have the full backtrace, then I will try to reproduce either under the debugger, or at least by checking the coredump from where it was running environment. (I didn't have access to it anymore, I only had the binary along with the coredump)
I will get back to you when I have more in depth traces, thanks for the feedback.
Issue is reproduced on 2 occasions, I shared both traces in zip file here. (One is the backtrace of the crash, the other the log at the moment of the crash) s183.zip s139.zip
One more information about the traces, there are 2 software that have registered callbacks on the disconnection notification:
orusw
using the sysrepo c api, and UserManagement
which is a python based service using sysrepo-python (sysrepo-python-1.5.0)
Thanks for the effort, I learned why the crashes occur.
In s139
, the problem is at
2023-10-16 13:10:42,343 DEBG 'netopeer2-server' stdout output:
[ERR]: SR: Waiting on a conditional variable failed (sr_notif_sub_del: Connection timed out).
which causes that a notification subscription is not removed and continues to work even after the NETCONF session is terminated and on the next notification it crashes because it tries to access the session. Quite straightforward but I have no way of knowing why the subscription could be removed, which is the actual problem that needs to be fixed.
For s183
it seems more complicated because there are a lot of strange errors saying an invalid argument is being used, presumably again a terminated NETCONF/sysrepo session. But no way of knowing how such a situation occurred. Also, there is not just a leftover notification subscription but a Call Home thread of a session as well.
Your own subscriptions to the notification should not matter but without more information about the initial problems, I am unable to fix anything. What should be fairly simple is if you could provide logs further in the past before the crashes, there may be some hints as to what is happening. Other than that I am not sure how to progress, ideally if you could write a script or some example that would reproduce the problem but it seems too specific and affected by data races for this to be possible without substantial effort.
Nevertheless, there are some notification subscription changes prepared, which are actually likely to fix this because the whole architecture will be split and hence simplified on the netopeer2 end.
or s183 it seems more complicated because there are a lot of strange errors saying an invalid argument is being used, presumably again a terminated NETCONF/sysrepo session
Yes I noticed those as well, but I don't think they are related to the crash.
They start a little bit earlier, and I suspect they come from this code snippet where the code tries to periodically fetch the number of subscribers to supervision:
Indeed I guess the session became disconnected / invalid in the meantime.
Scrolling up a bit I see this error that could explain why the session is invalid:
[ERR] Recursive reader limit 255 reached, possibly because of missing unlocks.
Regarding the sessions, do you know if the session count is only incrementing or can it decrement as well for instance when a session is closed ?
Because at the moment in the code, sessions are opened a little bit everywhere, for instance each time a callback is called it opens a session (and hopefully don't forget to close it). Due to this error, we were thinking of maybe centralizing to one session that is reused instead of the open/close pattern. Would that be a better approach ?
What should be fairly simple is if you could provide logs further in the past before the crashes, there may be some hints as to what is happening.
I attached the full log file supervisor_rusim.zip
Nevertheless, there are some notification subscription changes prepared, which are actually likely to fix this because the whole architecture will be split and hence simplified on the netopeer2 end.
Great ! Is this on a specific branch ? I could try to run on top of these changes to see how it goes.
[ERR] Recursive reader limit 255 reached, possibly because of missing unlocks.
This is a problem you should definitely fix and it is caused by you not freeing subscribers
in the snippet you posted.
Great ! Is this on a specific branch ? I could try to run on top of these changes to see how it goes.
It is not public yet and is still changing so I am actually not sure it will affect your problem in any way.
This is a problem you should definitely fix and it is caused by you not freeing subscribers in the snippet you posted.
Based on the error message, could it also be linked to the session being started but not stopped ? I have setup valgrind and will try to fix the issues reported by it.
Also, what do you think of this approach I mentioned previously ?
Regarding the sessions, do you know if the session count is only incrementing or can it decrement as well for instance when a session is closed ? Because at the moment in the code, sessions are opened a little bit everywhere, for instance each time a callback is called it opens a session (and hopefully don't forget to close it). Due to this error, we were thinking of maybe centralizing to one session that is reused instead of the open/close pattern. Would that be a better approach ?
is it better to have centralized sessions that are reused in the callbacks or we should open/close new ones in the callbacks ?
It is not public yet and is still changing so I am actually not sure it will affect your problem in any way.
Let me know when it is public, I can rerun and see how it goes with the changes. Meanwhile I will also work on my side to try to cleanup some errors.
Creating short-lived sessions repeatedly is not a problem at all, they are cheap to make, unlike connections.
Let me know when it is public, I can rerun and see how it goes with the changes.
I am making progress but at least a week or 2 are still needed.
Not sure if it's the same issue, but I encounter the same failed assertion after long runs of regression tests, albeit with a different backtrace, and maybe this could be useful for the analysis. It does not happen always on the same test, so this should be a race condition. The failures happen randomly on a relatively reduced set of tests, but only if I run a whole set of (successful) tests before.
I was able to run valgrind, and this is the output:
netopeer2-server: /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/common.c:2875: sr_rwrelock: Assertion
!rwlock->readers[0] && !rwlock->upgr && (rwlock->writer == cid)' failed.
==1777512==
==1777512== Process terminating with default action of signal 6 (SIGABRT)
==1777512== at 0x4D569FC: __pthread_kill_implementation (pthread_kill.c:44)
==1777512== by 0x4D569FC: pthread_kill_internal (pthread_kill.c:78)
==1777512== by 0x4D569FC: pthread_kill@@GLIBC_2.34 (pthread_kill.c:89)
==1777512== by 0x4D02475: raise (raise.c:26)
==1777512== by 0x4CE8884: abort (abort.c:100)
==1777512== by 0x4CE871A: assert_fail_base.cold (assert.c:92)
==1777512== by 0x4CF9E95: __assert_fail (assert.c:101)
==1777512== by 0x4C2B5ED: sr_rwrelock (common.c:2875)
==1777512== by 0x4C6382C: sr_shmmod_lock (shm_mod.c:1180)
==1777512== by 0x4C63FC5: sr_shmmod_modinfo_wrlock_downgrade (shm_mod.c:1399)
==1777512== by 0x4C1A80A: sr_changes_notify_store (sysrepo.c:3815)
==1777512== by 0x4C1AB5C: sr_apply_changes (sysrepo.c:3887)
==1777512== by 0x115FB6: np2srv_rpc_editconfig_cb (netconf.c:305)
==1777512== by 0x4C70AA4: sr_shmsub_rpc_listen_call_callback (shm_sub.c:4103)`
I am available to test any hypothesis for a fix
The main question is what sysrepo version are you using?
latest sysrepo 2.2.150 (netopeer2 2.2.13)
I am gathering some differences in behavior between this and netopeer 2.1.71, I'll let you know
I should be fine with a way to reproduce it, even if it is a script running lots of tests. But it may be enough if you use gdb
to print the cause of the assert and the backtrace of all the threads at that moment.
A strightforward way to reproduce it is rather difficult, because when re-running the only tests that had failed, they then usually pass. Furthermore, the assertion mentioned is only one of the possible outcomes. More often, Netopeer2 seems to shut down or get unresponsive on some netconf sessions, and for a while a netopeer2-cli started aside also remains blocked on connect. I will try to attach a debugger until I am able to get something.
So far I would say 2.1.71 runs properly unless I find more, yet I only tried once.
Netopeer2 2.2.13 fails on a set of tests that do interact with RPCs (e.g. sending from code or from a client a o-ran-operations/reset) or deal internally with the running datastore (e.g. energy saving on O-RU object has to set array carriers' active flag, which is config true, to SLEEP or back to ACTIVE).
In this very moment I am doing a run with debugger attached, the outcomes of errors as seen by the test client are a bit different but always on the same tests. Now it seems stuck on a mutex and a netopeer2-cli run aside, after a few weird errors, can't connect anymore.
The debugger is stuck in sys_futex_wait:
(gdb) info threads Id Target Id Frame * 1 Thread 0x7ff66b128240 (LWP 2441681) "netopeer2-serve" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 2 Thread 0x7ff66afe7640 (LWP 2441682) "netopeer2-serve" futex_wait (private=128, expected=, futex_word=0x7ff66b0b5000) at ../sysdeps/nptl/futex-internal.h:146 3 Thread 0x7ff66a7e6640 (LWP 2441683) "netopeer2-serve" 0x00007ff66c44abcf in __GI___poll (fds=0x7ff66a7e58e8, nfds=1, timeout=10000) at ../sysdeps/unix/sysv/linux/poll.c:29 4 Thread 0x7ff669fe5640 (LWP 2441684) "netopeer2-serve" 0x00007ff66c44abcf in __GI___poll (fds=0x7ff669fe48e8, nfds=1, timeout=10000) at ../sysdeps/unix/sysv/linux/poll.c:29 5 Thread 0x7ff6697e4640 (LWP 2441685) "netopeer2-serve" 0x00007ff66c44abcf in __GI___poll (fds=0x7ff6697e38e8, nfds=1, timeout=10000) at ../sysdeps/unix/sysv/linux/poll.c:29 6 Thread 0x7ff668fe3640 (LWP 2441686) "netopeer2-serve" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 7 Thread 0x7ff6687e2640 (LWP 2441687) "netopeer2-serve" 0x00007ff66c4177f8 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7ff6687e18a0, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78 8 Thread 0x7ff667f71640 (LWP 2467396) "netopeer2-serve" 0x00007ff66c44abcf in __GI___poll (fds=0x7ff667f708e8, nfds=1, timeout=10000) at ../sysdeps/unix/sysv/linux/poll.c:29 (gdb) bt #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007ff66c5c9046 in sys_futex_wait (uaddr=0x7ff66b0c4028, expected=203, clockid=1, timeout=0x7ffc1a0c6590) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/sr_cond/sr_cond_futex.c:72 #2 0x00007ff66c5c913c in sr_cond_wait_ (cond=0x7ff66b0c4028, mutex=0x7ff66b0c4000, clockid=1, timeout_abs=0x7ffc1a0c6590) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/sr_cond/sr_cond_futex.c:131 #3 0x00007ff66c5c91ef in sr_cond_clockwait (cond=0x7ff66b0c4028, mutex=0x7ff66b0c4000, clockid=1, timeout_abs=0x7ffc1a0c6590) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/sr_cond/sr_cond_futex.c:156 #4 0x00007ff66c5bbbcb in _sr_shmsub_notify_wait_wr (sub_shm=0x7ff66b0c4000, event=SR_SUB_EV_RPC, request_id=53, expected_ev=SR_SUB_EV_ERROR, clear_ev_on_err=0, cid=1, shm_data_sub=0x7ffc1a0c66b0, timeout_abs=0x7ffc1a0c6590, lock_lost=0x7ffc1a0c6640, cb_err_info=0x7ffc1a0c6780) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/shm_sub.c:461 #5 0x00007ff66c5bc29e in sr_shmsub_notify_wait_wr (sub_shm=0x7ff66b0c4000, expected_ev=SR_SUB_EV_ERROR, clear_ev_on_err=0, cid=1, shm_data_sub=0x7ffc1a0c66b0, timeout_ms=32000, lock_lost=0x7ffc1a0c6640, cb_err_info=0x7ffc1a0c6780) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/shm_sub.c:615 #6 0x00007ff66c5c2553 in sr_shmsub_rpc_notify (conn=0x5592c439ee00, sub_lock=0x7ff66b11cd70, subs=0x7ff66b11cdf8, sub_count=0x7ff66b11ce00, path=0x5592c43ae550 "/ietf-netconf-monitoring:get-schema", input=0x5592c47be0a0, orig_name=0x7ff658072e30 "netopeer2", orig_data=0x7ff658018800, timeout_ms=32000, request_id=0x7ffc1a0c6774, output=0x5592c43de5e8, cb_err_info=0x7ffc1a0c6780) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/shm_sub.c:2652 #7 0x00007ff66c5773b8 in _sr_rpc_send_tree (session=0x7ff658043200, mod_info=0x7ffc1a0c6850, path=0x5592c43ae550 "/ietf-netconf-monitoring:get-schema", input=0x5592c47be0a0, input_op=0x5592c47be0a0, timeout_ms=32000, output=0x7ffc1a0c68c8) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/sysrepo.c:6141 #8 0x00007ff66c578254 in sr_rpc_send_tree (session=0x7ff658043200, input=0x5592c47be0a0, timeout_ms=32000, output=0x7ffc1a0c68c8) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/sysrepo.c:6421 #9 0x00005592c3a14fa2 in np2srv_rpc_cb (rpc=0x5592c47be0a0, ncs=0x7ff658039310) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/netopeer2/src/main.c:376 #10 0x00007ff66c940790 in nc_server_send_reply_io (session=0x7ff658039310, io_timeout=10, rpc=0x5592c46c4950) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/libnetconf2/src/session_server.c:1534 #11 0x00007ff66c9414e4 in nc_ps_poll (ps=0x5592c4795b60, timeout=10, session=0x7ffc1a0c6b20) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/libnetconf2/src/session_server.c:1899 #12 0x00005592c3a17179 in worker_thread (arg=0x5592c4700c40) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/netopeer2/src/main.c:976 #13 0x00005592c3a181ee in main (argc=7, argv=0x7ffc1a0c6dd8) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/netopeer2/src/main.c:1322 (gdb) up #1 0x00007ff66c5c9046 in sys_futex_wait (uaddr=0x7ff66c9bd028, expected=2227, clockid=1, timeout=0x7ffc1a0c6590) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/sr_cond/sr_cond_futex.c:72 (gdb) print *uaddr $1 = 2227 (gdb) print expected $2 = 2227 (gdb) print clockid $3 = 1 (gdb) print *timeout $4 = {tv_sec = 226077, tv_nsec = 7296958}
Netopeer2-cli, in the meanwhile (initially the server was not stuck yet):
rossellom@tatanka:/home/massimo/Progetti/azcom/ORAN/ReferenceRU/test$ netopeer2-cli > connect > get --help get [--help] [--filter-subtree[=] | --filter-xpath ] [--defaults report-all|report-all-tagged|trim|explicit] [--out ] [--rpc-timeout ] > get --filter-xpath /o-ran-uplane-conf:user-plane-configuration/rx-array-carriers cli_send_recv: Timeout for receiving a reply expired. > get --filter-xpath /o-ran-uplane-conf:user-plane-configuration/rx-array-carriers nc ERROR: Received a with an unexpected message-id 52 (expected 53). cli_send_recv: Unexpected reply received - ignoring and waiting for the correct reply. cli_send_recv: Timeout for receiving a reply expired. > get --filter-xpath /o-ran-uplane-conf:user-plane-configuration/rx-array-carriers cli_send_recv: Timeout for receiving a reply expired. > get --filter-xpath /o-ran-uplane-conf:user-plane-configuration/rx-array-carriers nc ERROR: Received a with an unexpected message-id 53 (expected 55). cli_send_recv: Unexpected reply received - ignoring and waiting for the correct reply. cli_send_recv: Timeout for receiving a reply expired. > disconnect > connect ^C rossellom@tatanka:/home/massimo/Progetti/azcom/ORAN/ReferenceRU/test$ netopeer2-cli > connect ^C
Given the strange values in the timespec I went a bit upwards, and I found maybe a problem in sr_shmsub_notify_many_wait_wr(). The timeout_abs is not initialized and then 30000 ms get added via sr_timeouttime_get
static sr_error_info_t * sr_shmsub_notify_many_wait_wr(struct sr_shmsub_many_info_s *notify_subs, uint32_t notify_size, uint32_t notify_count, sr_sub_event_t expected_ev, int clear_ev_on_err, sr_cid_t cid, uint32_t timeout_ms) { sr_error_info_t *err_info = NULL, *tmp_err; struct sr_shmsub_many_info_s *nsub; struct timespec timeout_abs; uint32_t i; int lock_lost; /* compute the timeout */ sr_timeouttime_get(&timeout_abs, timeout_ms);
I'll try to recompile and retry after having added initialization
The timeout_abs is not initialized and then 30000 ms get added via sr_timeouttime_get
sr_timeouttime_get()
includes initialization and the "strange" values will most likely be a CLOCK_MONOTONIC
timestamp. So, I would say this is irrelevant. And I do not care what error occurs exactly, I just need to reproduce any faulty scenario if I am to fix this. Since it is a data race, it makes sense that something a bit different occurs on each run.
Agreed, but the problem is that whenever I attach a debugger, it the assert failure doesn't show up, whereas it does if a debugger is not attached. However, clients' requests go timeout and new connections sometimes fail on ssh authentication.
Here's a sample log where the client finally tries to reconnect
[INF]: NP: Session 94: thread 0 event new RPC. [INF]: NP: Session 94: thread 0 event reply error. [INF]: SR: EV ORIGIN: "/ietf-netconf:get" "rpc" ID 844 priority 0 for 1 subscribers published. [ERR]: SR: EV ORIGIN: SHM event "rpc" ID 844 processing timed out. [WRN]: SR: EV ORIGIN: "/ietf-netconf:get" "rpc" ID 844 priority 0 failed (Timeout expired). [ERR]: SR: User callback failed. [ERR]: NP: Failed to send an RPC (User callback failed). [INF]: NP: Session 94: thread 0 event new RPC. [INF]: NP: Session 94: thread 0 event reply error. [INF]: SR: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 584 priority 0 for 1 subscribers published. [ERR]: SR: EV ORIGIN: SHM event "rpc" ID 584 processing timed out. [WRN]: SR: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 584 priority 0 failed (Timeout expired). [ERR]: SR: User callback failed. [ERR]: NP: Failed to send an RPC (User callback failed). [INF]: NP: Session 94: thread 0 event new RPC. [INF]: NP: Session 94: thread 0 event reply error. [INF]: NP: Session 94: thread 0 event new RPC. [INF]: NP: Session 94: thread 0 event session terminated. [INF]: LN: Accepted a connection on 0.0.0.0:830 from 127.0.0.1:35306. [INF]: LN: Received an SSH message "request-service" of subtype "ssh-userauth". [INF]: LN: Received an SSH message "request-auth" of subtype "publickey". [WRN]: LN: User's "rossellom" public key doesn't match, trying another. [INF]: LN: User "rossellom" authenticated. [INF]: LN: Received an SSH message "request-channel-open" of subtype "session". [INF]: LN: Received an SSH message "request-channel" of subtype "subsystem". [INF]: SR: Session 9566 (user "root", CID 1) created. [INF]: SR: EV ORIGIN: "/ietf-netconf:get" "rpc" ID 845 priority 0 for 1 subscribers published. [ERR]: SR: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 457 was not processed. [WRN]: NP: Failed to send a notification (Timeout expired). [2024/03/12 22:40:44.235107, 1] ssh_socket_exception_callback: Socket exception callback: 1 (0) [2024/03/12 22:40:44.235120, 1] ssh_socket_exception_callback: Socket error: disconnected [ERR]: SR: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 457 was not processed. [WRN]: NP: Failed to send a notification (Timeout expired). [ERR]: SR: EV ORIGIN: SHM event "rpc" ID 845 processing timed out. [WRN]: SR: EV ORIGIN: "/ietf-netconf:get" "rpc" ID 845 priority 0 failed (Timeout expired). [ERR]: SR: User callback failed. [ERR]: NP: Failed to send an RPC (User callback failed). [INF]: NP: Session 95: thread 2 event new RPC. [INF]: NP: Session 95: thread 2 event reply error. [INF]: SR: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 585 priority 0 for 1 subscribers published. [ERR]: SR: EV ORIGIN: SHM event "rpc" ID 585 processing timed out. [WRN]: SR: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 585 priority 0 failed (Timeout expired). [ERR]: SR: User callback failed. [ERR]: NP: Failed to send an RPC (User callback failed). [INF]: NP: Session 95: thread 2 event new RPC. [INF]: NP: Session 95: thread 2 event reply error. [INF]: NP: Session 95: thread 2 event new RPC. [INF]: NP: Session 95: thread 2 event session terminated. [INF]: LN: Accepted a connection on 0.0.0.0:830 from 127.0.0.1:41386. [INF]: LN: Received an SSH message "request-service" of subtype "ssh-userauth". [INF]: LN: Received an SSH message "request-auth" of subtype "publickey". [WRN]: LN: User's "rossellom" public key doesn't match, trying another. [INF]: LN: User "rossellom" authenticated. [INF]: LN: Received an SSH message "request-channel-open" of subtype "session". [INF]: LN: Received an SSH message "request-channel" of subtype "subsystem". [INF]: SR: Session 9567 (user "root", CID 1) created. [INF]: SR: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 586 priority 0 for 1 subscribers published. [ERR]: SR: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 457 was not processed. [ERR]: SR: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 457 was not processed. [WRN]: NP: Failed to send a notification (Timeout expired). [WRN]: NP: Failed to send a notification (Timeout expired). [2024/03/12 22:43:55.431914, 1] ssh_socket_exception_callback: Socket exception callback: 1 (0) [2024/03/12 22:43:55.431930, 1] ssh_socket_exception_callback: Socket error: disconnected [INF]: LN: Accepted a connection on 0.0.0.0:830 from 127.0.0.1:51930. [2024/03/12 22:43:55.435561, 1] ssh_socket_exception_callback: Socket exception callback: 1 (0) [2024/03/12 22:43:55.435584, 1] ssh_socket_exception_callback: Socket error: disconnected [ERR]: LN: SSH key exchange error (Socket error: disconnected).
Regarding the futex thing, I don't know the topic that well, but I read:
This operation atomically verifies that the futex address still contains the value given, and sleeps awaiting FUTEX_WAKE on this futex address. If the timeout argument is non-NULL, its contents describe the maximum duration of the wait, which is infinite otherwise.
so shouldn't the timespec contain { 30, 0 } (for 30000 msec) rather than add them up to the clock timestamp? I do believe, anyways, that this may impact only once the problem already showed up.
From the man
page:
FUTEX_WAIT_BITSET (since Linux 2.6.25)
...
If timeout is not NULL, the structure it points to specifies an absolute timeout for the wait operation. If timeout is NULL, the opera‐
tion can block indefinitely.
For the last time, this is a dead end and the shortest way to fix this is if I could reproduce it. Otherwise you are on your own. Also, not sure what versions are you using exactly, most likely the latest release of all the projects (mainly sysrepo) but you could also try to use the devel
branch, there are quite a lot of changes already.
ok understood. Since I am short of time too, I'll stick to v2.1.71 (and compatible components of course) where everything works like a charm. Good luck
Well, okay. Obviously, I would rather you help me fix this but that is up to you.
Versions:
Netopeer2-server: netopeer2-2.1.71 sysrepo: sysrepo-2.2.105 libnetconf2: libnetconf2-2.1.37 libyang: libyang-2.1.111Hi,
I have a sporadic issue (pretty rare) where the disconnection of a user seems to be the initial trigger for a crash of the netopeer2-server.
In the trace below, after the connection is broken/stopped non-gracefully (scenario where the user/client crashes and the expectation is that the server is able to recover and continue without issues), there is this trace that starts to appear:
Waiting on a conditional variable failed (sr_notif_sub_del: Connection timed out).
Despite that, netopeer2-server is able to broadcast the notification about the session being terminated, but then it crashes with segfault. (unfortunately I don't have the coredump, I will try to reproduce and collect the backtrace if possible)
Following that crash, there is this log from orusw (our app with callback on user disconnection etc through sysrepo)
Waiting on a conditional variable failed (sr_shmext_notif_sub_stop: Connection timed out).
followed by an assert which crashes the application as wellorusw: /workspace/Services/3rdParty/sysrepo-2.2.105/src/common.c:2839: sr_rwrelock: Assertion
!rwlock->readers[0] && !rwlock->upgr && (rwlock->writer == cid)' failed.` I guess this is a consequence of netopeer2-server crashing and corrupting sysrepo, but I wanted to include it as well if it helps finding the root cause of the problem.Log: