CESNET / netopeer2

NETCONF toolset
BSD 3-Clause "New" or "Revised" License
291 stars 186 forks source link

Netopeer2-server crash after user disconnection #1481

Closed Krisscut closed 3 months ago

Krisscut commented 9 months ago

Versions: Netopeer2-server: netopeer2-2.1.71 sysrepo: sysrepo-2.2.105 libnetconf2: libnetconf2-2.1.37 libyang: libyang-2.1.111

Hi,

I have a sporadic issue (pretty rare) where the disconnection of a user seems to be the initial trigger for a crash of the netopeer2-server.

In the trace below, after the connection is broken/stopped non-gracefully (scenario where the user/client crashes and the expectation is that the server is able to recover and continue without issues), there is this trace that starts to appear: Waiting on a conditional variable failed (sr_notif_sub_del: Connection timed out).

Despite that, netopeer2-server is able to broadcast the notification about the session being terminated, but then it crashes with segfault. (unfortunately I don't have the coredump, I will try to reproduce and collect the backtrace if possible)

Following that crash, there is this log from orusw (our app with callback on user disconnection etc through sysrepo)Waiting on a conditional variable failed (sr_shmext_notif_sub_stop: Connection timed out). followed by an assert which crashes the application as well orusw: /workspace/Services/3rdParty/sysrepo-2.2.105/src/common.c:2839: sr_rwrelock: Assertion!rwlock->readers[0] && !rwlock->upgr && (rwlock->writer == cid)' failed.` I guess this is a consequence of netopeer2-server crashing and corrupting sysrepo, but I wanted to include it as well if it helps finding the root cause of the problem.

Log:

2023-10-10 06:04:21,180 DEBG 'orusw' stdout output: 2023-10-10 06:04:21/INF:UE_APP-OAM/sysrepo[RU01]: UE_APP-OAM/sysrepo, EV LISTEN: "/o-ran-supervision:supervision-watchdog-reset" "rpc" ID 151 priority 0 success (remaining 0 subscribers). 2023-10-10 06:04:21/INF:UE_APP-OAM/sysrepo[RU01]: UE_APP-OAM/sysrepo, Session 3503 (user "root", CID 28) created.

2023-10-10 06:04:21,181 DEBG 'netopeer2-server' stdout output: [INF]: SR: EV ORIGIN: "/o-ran-supervision:supervision-watchdog-reset" "rpc" ID 151 priority 0 succeeded.

2023-10-10 06:04:21,181 DEBG 'orusw' stdout output: 2023-10-10 06:04:21/INF:UE_APP-OAM/orusw[RU01]: airphone-orusim/orusw/o-ran-supervision.c:109 Number of subscribers to supervision is 1

2023-10-10 06:04:21,182 DEBG 'netopeer2-server' stdout output: [INF]: NP: Session 1: thread 2 event new RPC.

2023-10-10 06:04:47,358 DEBG 'netopeer2-server' stdout output: [2023/10/10 06:04:47.358228, 1] ssh_socket_exception_callback: Socket exception callback: 1 (0) [2023/10/10 06:04:47.358337, 1] ssh_socket_exception_callback: Socket error: disconnected

2023-10-10 06:04:47,359 DEBG 'netopeer2-server' stdout output: [ERR]: LN: Session 1: SSH channel poll error (Socket error: disconnected). [INF]: NP: Session 1: thread 1 event session terminated.

2023-10-10 06:04:47,648 DEBG 'netopeer2-server' stdout output: [INF]: LN: Call Home client "NC-Client-1" session terminated. [INF]: LN: Trying to connect via IPv4 to 192.168.216.2:4334.

2023-10-10 06:04:47,648 DEBG 'netopeer2-server' stdout output: [INF]: LN: getsockopt() error (Connection refused).

2023-10-10 06:04:49,648 DEBG 'netopeer2-server' stdout output: [INF]: LN: Trying to connect via IPv4 to 192.168.216.2:4334.

2023-10-10 06:04:49,649 DEBG 'netopeer2-server' stdout output: [INF]: LN: getsockopt() error (Connection refused).

2023-10-10 06:04:51,649 DEBG 'netopeer2-server' stdout output: [INF]: LN: Call Home client "NC-Client-1" endpoint "default-ssh" failed connection attempt limit 3 reached. [INF]: LN: Call Home client "NC-Client-1" endpoint "default-ssh" connecting... [INF]: LN: Trying to connect via IPv4 to 192.168.216.2:4334.

2023-10-10 06:04:51,649 DEBG 'netopeer2-server' stdout output: [INF]: LN: getsockopt() error (Connection refused).

2023-10-10 06:04:53,650 DEBG 'netopeer2-server' stdout output: [INF]: LN: Trying to connect via IPv4 to 192.168.216.2:4334.

2023-10-10 06:04:53,651 DEBG 'netopeer2-server' stdout output: [INF]: LN: getsockopt() error (Connection refused).

2023-10-10 06:04:55,651 DEBG 'netopeer2-server' stdout output: [INF]: LN: Trying to connect via IPv4 to 192.168.216.2:4334.

2023-10-10 06:04:56,152 DEBG 'netopeer2-server' stdout output: [INF]: LN: Timed out after 500 ms (Operation now in progress).

2023-10-10 06:04:58,152 DEBG 'netopeer2-server' stdout output: [INF]: LN: Call Home client "NC-Client-1" endpoint "default-ssh" failed connection attempt limit 3 reached. [INF]: LN: Call Home client "NC-Client-1" endpoint "default-ssh" connecting... [INF]: LN: Trying to connect via IPv4 to 192.168.216.2:4334.

2023-10-10 06:04:58,653 DEBG 'netopeer2-server' stdout output: [INF]: LN: Timed out after 500 ms (Operation now in progress).

2023-10-10 06:05:00,655 DEBG 'netopeer2-server' stdout output: [INF]: LN: Trying to connect the pending socket 49.

2023-10-10 06:05:01,156 DEBG 'netopeer2-server' stdout output: [INF]: LN: Timed out after 500 ms (Operation now in progress).

2023-10-10 06:05:02,374 DEBG 'netopeer2-server' stdout output: [ERR]: SR: Waiting on a conditional variable failed (sr_notif_sub_del: Connection timed out).

2023-10-10 06:05:03,156 DEBG 'netopeer2-server' stdout output: [INF]: LN: Trying to connect the pending socket 49.

2023-10-10 06:05:03,656 DEBG 'netopeer2-server' stdout output: [INF]: LN: Timed out after 500 ms (Operation now in progress).

2023-10-10 06:05:05,657 DEBG 'netopeer2-server' stdout output: [INF]: LN: Call Home client "NC-Client-1" endpoint "default-ssh" failed connection attempt limit 3 reached. [INF]: LN: Call Home client "NC-Client-1" endpoint "default-ssh" connecting...

2023-10-10 06:05:05,657 DEBG 'netopeer2-server' stdout output: [INF]: LN: Trying to connect via IPv4 to 192.168.216.2:4334.

2023-10-10 06:05:06,157 DEBG 'netopeer2-server' stdout output: [INF]: LN: Timed out after 500 ms (Operation now in progress).

2023-10-10 06:05:08,158 DEBG 'netopeer2-server' stdout output: [INF]: LN: Trying to connect the pending socket 49.

2023-10-10 06:05:08,658 DEBG 'netopeer2-server' stdout output: [INF]: LN: Timed out after 500 ms (Operation now in progress).

2023-10-10 06:05:10,658 DEBG 'netopeer2-server' stdout output: [INF]: LN: Trying to connect the pending socket 49.

2023-10-10 06:05:11,159 DEBG 'netopeer2-server' stdout output: [INF]: LN: Timed out after 500 ms (Operation now in progress).

2023-10-10 06:05:13,159 DEBG 'netopeer2-server' stdout output: [INF]: LN: Call Home client "NC-Client-1" endpoint "default-ssh" failed connection attempt limit 3 reached. [INF]: LN: Call Home client "NC-Client-1" endpoint "default-ssh" connecting... [INF]: LN: Trying to connect via IPv4 to 192.168.216.2:4334.

2023-10-10 06:05:13,660 DEBG 'netopeer2-server' stdout output: [INF]: LN: Timed out after 500 ms (Operation now in progress).

2023-10-10 06:05:15,660 DEBG 'netopeer2-server' stdout output: [INF]: LN: Trying to connect the pending socket 49.

2023-10-10 06:05:16,160 DEBG 'netopeer2-server' stdout output: [INF]: LN: Timed out after 500 ms (Operation now in progress).

2023-10-10 06:05:17,375 DEBG 'netopeer2-server' stdout output: [ERR]: SR: Waiting on a conditional variable failed (sr_notif_sub_del: Connection timed out). [INF]: SR: EV ORIGIN: "ietf-netconf-notifications" "notif" ID 41 priority 0 for 5 subscribers published.

2023-10-10 06:05:17,375 DEBG 'netopeer2-server' stdout output: [INF]: NP: Generated new event (netconf-session-end). [INF]: SR: EV LISTEN: "ietf-netconf-notifications" "notif" ID 41 processing.

2023-10-10 06:05:17,376 DEBG 'orusw' stdout output: 2023-10-10 06:05:17/INF:UE_APP-OAM/sysrepo[RU01]: UE_APP-OAM/sysrepo, EV LISTEN: "ietf-netconf-notifications" "notif" ID 41 processing.

2023-10-10 06:05:17,376 DEBG 'orusw' stdout output: 2023-10-10 06:05:17/INF:UE_APP-OAM/orusw[RU01]: airphone-orusim/orusw/ietf-netconf-notifications.c:35 NETCONF Client 192.168.216.2 disconnected 2023-10-10 06:05:17/INF:UE_APP-OAM/orusw[RU01]: airphone-orusim/orusw/ietf-netconf-notifications.c:39 NETCONF Session Termination Reason is other 2023-10-10 06:05:17/INF:UE_APP-OAM/sysrepo[RU01]: UE_APP-OAM/sysrepo, EV LISTEN: "ietf-netconf-notifications" "notif" ID 41 priority 0 success (remaining 2 subscribers).

2023-10-10 06:05:17,376 DEBG 'UserManagement' stderr output: 2023-10-10 06:05:17.376 INFO/UE_APP-OAM/UserManagement: UserManagement.py:208 - callbackUserDisconnection() - Received notification {'username': 'docomo999', 'session-id': 1, 'source-host': '192.168.216.2', 'termination-reason': 'other'} at 1696917917 (type realtime).

2023-10-10 06:05:17,380 DEBG 'UserManagement' stderr output: 2023-10-10 06:05:17.379 DEBUG/UE_APP-OAM/UserManagement: UserManagement.py:224 - callbackUserDisconnection() - Disconnected user is not sudo but None, nothing to do

2023-10-10 06:05:17,397 DEBG fd 90 closed, stopped monitoring <POutputDispatcher at 140189124471584 for <Subprocess at 140189126124880 with name netopeer2-server in state RUNNING> (stdout)> 2023-10-10 06:05:18,399 INFO exited: netopeer2-server (terminated by SIGSEGV (core dumped); not expected) 2023-10-10 06:05:18,399 DEBG received SIGCHLD indicating a child quit 2023-10-10 06:05:22,182 DEBG 'orusw' stdout output: 2023-10-10 06:05:22/INF:UE_APP-OAM/orusw[RU01]: airphone-orusim/orusw/o-ran-supervision.c:122 Expiration of supervision interval 60 (pid = 140704008832576) 2023-10-10 06:05:22/INF:UE_APP-OAM/sysrepo[RU01]: UE_APP-OAM/sysrepo, Session 3515 (user "root", CID 28) created. [WRN] Connection with CID 27 is dead. 2023-10-10 06:05:22/WRN:UE_APP-OAM/sysrepo[RU01]: UE_APP-OAM/sysrepo, Connection with CID 27 is dead.

2023-10-10 06:05:37,184 DEBG 'orusw' stdout output: [ERR] Waiting on a conditional variable failed (sr_shmext_notif_sub_stop: Connection timed out). 2023-10-10 06:05:37/ERR:UE_APP-OAM/sysrepo[RU01]: UE_APP-OAM/sysrepo, Waiting on a conditional variable failed (sr_shmext_notif_sub_stop: Connection timed out). [WRN] Recovering module "o-ran-supervision" notification subscription of CID 27. 2023-10-10 06:05:37/WRN:UE_APP-OAM/sysrepo[RU01]: UE_APP-OAM/sysrepo, Recovering module "o-ran-supervision" notification subscription of CID 27. orusw: /workspace/Services/3rdParty/sysrepo-2.2.105/src/common.c:2839: sr_rwrelock: Assertion `!rwlock->readers[0] && !rwlock->upgr && (rwlock->writer == cid)' failed. 2023-10-10 06:05:37/INF:UE_APP-OAM/orusw[RU01]: airphone-orusim/orusw/o-ran-main.c:232 Handler called for signal Aborted 2023-10-10 06:05:37/INF:UE_APP-OAM/orusw[RU01]: airphone-orusim/orusw/o-ran-main.c:263 Received SIGSEGV or SIGABRT /opt/UE_APP_OAM/service/oran/orusw() [0x451852] /lib64/libc.so.6(+0x54df0) [0x7ff8a2b4fdf0] /lib64/libc.so.6(+0xa154c) [0x7ff8a2b9c54c] /lib64/libc.so.6(raise+0x16) [0x7ff8a2b4fd46] /lib64/libc.so.6(abort+0xd3) [0x7ff8a2b237f3] /lib64/libc.so.6(+0x2871b) [0x7ff8a2b2371b] /lib64/libc.so.6(+0x4dce6) [0x7ff8a2b48ce6] /opt/UE_APP_OAM/lib64/libsysrepo.so.7(+0x24461) [0x7ff8a351b461] /opt/UE_APP_OAM/lib64/libsysrepo.so.7(+0x54fee) [0x7ff8a354bfee] /opt/UE_APP_OAM/lib64/libsysrepo.so.7(+0x30e07) [0x7ff8a3527e07] /opt/UE_APP_OAM/lib64/libsysrepo.so.7(+0x65ac3) [0x7ff8a355cac3] /opt/UE_APP_OAM/lib64/libsysrepo.so.7(sr_notif_send_tree+0x7a6) [0x7ff8a3512e16] /opt/UE_APP_OAM/service/oran/orusw(monitorCallback+0x113) [0x456240] /opt/UE_APP_OAM/service/oran/orusw(sampleThread+0x179) [0x45641b] /lib64/libc.so.6(+0x9f802) [0x7ff8a2b9a802] 2023-10-10 06:05:37/ERR:UE_APP-OAM/orusw[RU01]: airphone-orusim/orusw/o-ran-main.c:283 You can check for coredumps in /opt/dev 2023-10-10 06:05:37/INF:UE_APP-OAM/orusw[RU01]: airphone-orusim/orusw/o-ran-main.c:232 Handler called for signal Aborted

2023-10-10 06:05:37,186 DEBG 'orusw' stdout output:

========== DATA FOR "ietf-netconf-monitoring" "/ietf-netconf-monitoring:netconf-state/ietf-netconf-monitoring:capabilities" request_id=1 REQUESTED =======================

========== DATA FOR "ietf-netconf-monitoring" "/ietf-netconf-monitoring:netconf-state/ietf-netconf-monitoring:capabilities" request_id=2 REQUESTED =======================

2023-10-10 06:05:37,188 DEBG fd 19 closed, stopped monitoring <POutputDispatcher at 140189124470816 for <Subprocess at 140189126124640 with name orusw in state RUNNING> (stdout)> 2023-10-10 06:05:37,188 INFO exited: orusw (exit status 1; not expected) 2023-10-10 06:05:37,188 DEBG received SIGCHLD indicating a child quit

michalvasko commented 9 months ago

Seems like a really complex use-case with a difficult reproducer so I am not sure I can do anything about this, definitely not without more information. It would be great if you could get the backtrace or valgrind output for the crash

2023-10-10 06:05:18,399 INFO exited: netopeer2-server (terminated by SIGSEGV (core dumped); not expected)

but a core dump should be fine, too.

Krisscut commented 9 months ago

Yes it's quite the edge case scenario indeed, and it's not reproductible 100% of the times, but it still happen occasionally.

coredump.zip

I attached it to this message.

michalvasko commented 9 months ago

I failed to load the core dump properly for some reason so please provide the stack trace of all the threads at the moment of the crash or simply the valgrind output (if you manage to reproduce it with it).

jktjkt commented 9 months ago

Coredumps are not portable; it's just a dump of the memory image of the process. Without all the libs in exact same places (or a lot of manual work), it's not usable.

michalvasko commented 9 months ago

Naturally, but even using the binary from the archive did not help much and provided almost no debugging information.

Krisscut commented 9 months ago

It might be because it wasn't compiled in debug mode or something, I will check the build process to see if I can enable debug options & have the full backtrace etc

jktjkt commented 9 months ago

It's a dynamically linked binary and the dynamic libraries are missing, so there's going to be a ton of symbols missing. Anyway, rather than wasting the time here with sending binaries around, why don't you run the server under a debugger and print out the backtrace once it crashes?

Krisscut commented 9 months ago

It's a dynamically linked binary and the dynamic libraries are missing, so there's going to be a ton of symbols missing. Anyway, rather than wasting the time here with sending binaries around, why don't you run the server under a debugger and print out the backtrace once it crashes?

Yes that's the plan, as stated above I will check the compilation mode to ensure to have the full backtrace, then I will try to reproduce either under the debugger, or at least by checking the coredump from where it was running environment. (I didn't have access to it anymore, I only had the binary along with the coredump)

I will get back to you when I have more in depth traces, thanks for the feedback.

Krisscut commented 8 months ago

Issue is reproduced on 2 occasions, I shared both traces in zip file here. (One is the backtrace of the crash, the other the log at the moment of the crash) s183.zip s139.zip

One more information about the traces, there are 2 software that have registered callbacks on the disconnection notification: orusw using the sysrepo c api, and UserManagement which is a python based service using sysrepo-python (sysrepo-python-1.5.0)

michalvasko commented 8 months ago

Thanks for the effort, I learned why the crashes occur.

In s139, the problem is at

2023-10-16 13:10:42,343 DEBG 'netopeer2-server' stdout output:
[ERR]: SR: Waiting on a conditional variable failed (sr_notif_sub_del: Connection timed out).

which causes that a notification subscription is not removed and continues to work even after the NETCONF session is terminated and on the next notification it crashes because it tries to access the session. Quite straightforward but I have no way of knowing why the subscription could be removed, which is the actual problem that needs to be fixed.

For s183 it seems more complicated because there are a lot of strange errors saying an invalid argument is being used, presumably again a terminated NETCONF/sysrepo session. But no way of knowing how such a situation occurred. Also, there is not just a leftover notification subscription but a Call Home thread of a session as well.

Your own subscriptions to the notification should not matter but without more information about the initial problems, I am unable to fix anything. What should be fairly simple is if you could provide logs further in the past before the crashes, there may be some hints as to what is happening. Other than that I am not sure how to progress, ideally if you could write a script or some example that would reproduce the problem but it seems too specific and affected by data races for this to be possible without substantial effort.

Nevertheless, there are some notification subscription changes prepared, which are actually likely to fix this because the whole architecture will be split and hence simplified on the netopeer2 end.

Krisscut commented 8 months ago

or s183 it seems more complicated because there are a lot of strange errors saying an invalid argument is being used, presumably again a terminated NETCONF/sysrepo session

Yes I noticed those as well, but I don't think they are related to the crash. They start a little bit earlier, and I suspect they come from this code snippet where the code tries to periodically fetch the number of subscribers to supervision: image

Indeed I guess the session became disconnected / invalid in the meantime.

Scrolling up a bit I see this error that could explain why the session is invalid: [ERR] Recursive reader limit 255 reached, possibly because of missing unlocks. Regarding the sessions, do you know if the session count is only incrementing or can it decrement as well for instance when a session is closed ? Because at the moment in the code, sessions are opened a little bit everywhere, for instance each time a callback is called it opens a session (and hopefully don't forget to close it). Due to this error, we were thinking of maybe centralizing to one session that is reused instead of the open/close pattern. Would that be a better approach ?

What should be fairly simple is if you could provide logs further in the past before the crashes, there may be some hints as to what is happening.

I attached the full log file supervisor_rusim.zip

Nevertheless, there are some notification subscription changes prepared, which are actually likely to fix this because the whole architecture will be split and hence simplified on the netopeer2 end.

Great ! Is this on a specific branch ? I could try to run on top of these changes to see how it goes.

michalvasko commented 8 months ago

[ERR] Recursive reader limit 255 reached, possibly because of missing unlocks.

This is a problem you should definitely fix and it is caused by you not freeing subscribers in the snippet you posted.

Great ! Is this on a specific branch ? I could try to run on top of these changes to see how it goes.

It is not public yet and is still changing so I am actually not sure it will affect your problem in any way.

Krisscut commented 8 months ago

This is a problem you should definitely fix and it is caused by you not freeing subscribers in the snippet you posted.

Based on the error message, could it also be linked to the session being started but not stopped ? I have setup valgrind and will try to fix the issues reported by it.

Also, what do you think of this approach I mentioned previously ?

Regarding the sessions, do you know if the session count is only incrementing or can it decrement as well for instance when a session is closed ? Because at the moment in the code, sessions are opened a little bit everywhere, for instance each time a callback is called it opens a session (and hopefully don't forget to close it). Due to this error, we were thinking of maybe centralizing to one session that is reused instead of the open/close pattern. Would that be a better approach ?

is it better to have centralized sessions that are reused in the callbacks or we should open/close new ones in the callbacks ?

It is not public yet and is still changing so I am actually not sure it will affect your problem in any way.

Let me know when it is public, I can rerun and see how it goes with the changes. Meanwhile I will also work on my side to try to cleanup some errors.

michalvasko commented 8 months ago

Creating short-lived sessions repeatedly is not a problem at all, they are cheap to make, unlike connections.

Let me know when it is public, I can rerun and see how it goes with the changes.

I am making progress but at least a week or 2 are still needed.

maxrossello commented 4 months ago

Not sure if it's the same issue, but I encounter the same failed assertion after long runs of regression tests, albeit with a different backtrace, and maybe this could be useful for the analysis. It does not happen always on the same test, so this should be a race condition. The failures happen randomly on a relatively reduced set of tests, but only if I run a whole set of (successful) tests before.

I was able to run valgrind, and this is the output:

netopeer2-server: /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/common.c:2875: sr_rwrelock: Assertion!rwlock->readers[0] && !rwlock->upgr && (rwlock->writer == cid)' failed. ==1777512== ==1777512== Process terminating with default action of signal 6 (SIGABRT) ==1777512== at 0x4D569FC: __pthread_kill_implementation (pthread_kill.c:44) ==1777512== by 0x4D569FC: pthread_kill_internal (pthread_kill.c:78) ==1777512== by 0x4D569FC: pthread_kill@@GLIBC_2.34 (pthread_kill.c:89) ==1777512== by 0x4D02475: raise (raise.c:26) ==1777512== by 0x4CE8884: abort (abort.c:100) ==1777512== by 0x4CE871A: assert_fail_base.cold (assert.c:92) ==1777512== by 0x4CF9E95: __assert_fail (assert.c:101) ==1777512== by 0x4C2B5ED: sr_rwrelock (common.c:2875) ==1777512== by 0x4C6382C: sr_shmmod_lock (shm_mod.c:1180) ==1777512== by 0x4C63FC5: sr_shmmod_modinfo_wrlock_downgrade (shm_mod.c:1399) ==1777512== by 0x4C1A80A: sr_changes_notify_store (sysrepo.c:3815) ==1777512== by 0x4C1AB5C: sr_apply_changes (sysrepo.c:3887) ==1777512== by 0x115FB6: np2srv_rpc_editconfig_cb (netconf.c:305) ==1777512== by 0x4C70AA4: sr_shmsub_rpc_listen_call_callback (shm_sub.c:4103)`

I am available to test any hypothesis for a fix

michalvasko commented 4 months ago

The main question is what sysrepo version are you using?

maxrossello commented 4 months ago

latest sysrepo 2.2.150 (netopeer2 2.2.13)

I am gathering some differences in behavior between this and netopeer 2.1.71, I'll let you know

michalvasko commented 4 months ago

I should be fine with a way to reproduce it, even if it is a script running lots of tests. But it may be enough if you use gdb to print the cause of the assert and the backtrace of all the threads at that moment.

maxrossello commented 4 months ago

A strightforward way to reproduce it is rather difficult, because when re-running the only tests that had failed, they then usually pass. Furthermore, the assertion mentioned is only one of the possible outcomes. More often, Netopeer2 seems to shut down or get unresponsive on some netconf sessions, and for a while a netopeer2-cli started aside also remains blocked on connect. I will try to attach a debugger until I am able to get something.

So far I would say 2.1.71 runs properly unless I find more, yet I only tried once.

Netopeer2 2.2.13 fails on a set of tests that do interact with RPCs (e.g. sending from code or from a client a o-ran-operations/reset) or deal internally with the running datastore (e.g. energy saving on O-RU object has to set array carriers' active flag, which is config true, to SLEEP or back to ACTIVE).

In this very moment I am doing a run with debugger attached, the outcomes of errors as seen by the test client are a bit different but always on the same tests. Now it seems stuck on a mutex and a netopeer2-cli run aside, after a few weird errors, can't connect anymore.

The debugger is stuck in sys_futex_wait:

(gdb) info threads
  Id   Target Id                                             Frame 
* 1    Thread 0x7ff66b128240 (LWP 2441681) "netopeer2-serve" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2    Thread 0x7ff66afe7640 (LWP 2441682) "netopeer2-serve" futex_wait (private=128, expected=, futex_word=0x7ff66b0b5000) at ../sysdeps/nptl/futex-internal.h:146
  3    Thread 0x7ff66a7e6640 (LWP 2441683) "netopeer2-serve" 0x00007ff66c44abcf in __GI___poll (fds=0x7ff66a7e58e8, nfds=1, timeout=10000) at ../sysdeps/unix/sysv/linux/poll.c:29
  4    Thread 0x7ff669fe5640 (LWP 2441684) "netopeer2-serve" 0x00007ff66c44abcf in __GI___poll (fds=0x7ff669fe48e8, nfds=1, timeout=10000) at ../sysdeps/unix/sysv/linux/poll.c:29
  5    Thread 0x7ff6697e4640 (LWP 2441685) "netopeer2-serve" 0x00007ff66c44abcf in __GI___poll (fds=0x7ff6697e38e8, nfds=1, timeout=10000) at ../sysdeps/unix/sysv/linux/poll.c:29
  6    Thread 0x7ff668fe3640 (LWP 2441686) "netopeer2-serve" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  7    Thread 0x7ff6687e2640 (LWP 2441687) "netopeer2-serve" 0x00007ff66c4177f8 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7ff6687e18a0, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  8    Thread 0x7ff667f71640 (LWP 2467396) "netopeer2-serve" 0x00007ff66c44abcf in __GI___poll (fds=0x7ff667f708e8, nfds=1, timeout=10000) at ../sysdeps/unix/sysv/linux/poll.c:29
(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007ff66c5c9046 in sys_futex_wait (uaddr=0x7ff66b0c4028, expected=203, clockid=1, timeout=0x7ffc1a0c6590) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/sr_cond/sr_cond_futex.c:72
#2  0x00007ff66c5c913c in sr_cond_wait_ (cond=0x7ff66b0c4028, mutex=0x7ff66b0c4000, clockid=1, timeout_abs=0x7ffc1a0c6590) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/sr_cond/sr_cond_futex.c:131
#3  0x00007ff66c5c91ef in sr_cond_clockwait (cond=0x7ff66b0c4028, mutex=0x7ff66b0c4000, clockid=1, timeout_abs=0x7ffc1a0c6590) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/sr_cond/sr_cond_futex.c:156
#4  0x00007ff66c5bbbcb in _sr_shmsub_notify_wait_wr (sub_shm=0x7ff66b0c4000, event=SR_SUB_EV_RPC, request_id=53, expected_ev=SR_SUB_EV_ERROR, clear_ev_on_err=0, cid=1, shm_data_sub=0x7ffc1a0c66b0, timeout_abs=0x7ffc1a0c6590, lock_lost=0x7ffc1a0c6640, cb_err_info=0x7ffc1a0c6780) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/shm_sub.c:461
#5  0x00007ff66c5bc29e in sr_shmsub_notify_wait_wr (sub_shm=0x7ff66b0c4000, expected_ev=SR_SUB_EV_ERROR, clear_ev_on_err=0, cid=1, shm_data_sub=0x7ffc1a0c66b0, timeout_ms=32000, lock_lost=0x7ffc1a0c6640, cb_err_info=0x7ffc1a0c6780) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/shm_sub.c:615
#6  0x00007ff66c5c2553 in sr_shmsub_rpc_notify (conn=0x5592c439ee00, sub_lock=0x7ff66b11cd70, subs=0x7ff66b11cdf8, sub_count=0x7ff66b11ce00, path=0x5592c43ae550 "/ietf-netconf-monitoring:get-schema", input=0x5592c47be0a0, orig_name=0x7ff658072e30 "netopeer2", orig_data=0x7ff658018800, timeout_ms=32000, request_id=0x7ffc1a0c6774, output=0x5592c43de5e8, cb_err_info=0x7ffc1a0c6780) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/shm_sub.c:2652
#7  0x00007ff66c5773b8 in _sr_rpc_send_tree (session=0x7ff658043200, mod_info=0x7ffc1a0c6850, path=0x5592c43ae550 "/ietf-netconf-monitoring:get-schema", input=0x5592c47be0a0, input_op=0x5592c47be0a0, timeout_ms=32000, output=0x7ffc1a0c68c8) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/sysrepo.c:6141
#8  0x00007ff66c578254 in sr_rpc_send_tree (session=0x7ff658043200, input=0x5592c47be0a0, timeout_ms=32000, output=0x7ffc1a0c68c8) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/sysrepo.c:6421
#9  0x00005592c3a14fa2 in np2srv_rpc_cb (rpc=0x5592c47be0a0, ncs=0x7ff658039310) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/netopeer2/src/main.c:376
#10 0x00007ff66c940790 in nc_server_send_reply_io (session=0x7ff658039310, io_timeout=10, rpc=0x5592c46c4950) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/libnetconf2/src/session_server.c:1534
#11 0x00007ff66c9414e4 in nc_ps_poll (ps=0x5592c4795b60, timeout=10, session=0x7ffc1a0c6b20) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/libnetconf2/src/session_server.c:1899
#12 0x00005592c3a17179 in worker_thread (arg=0x5592c4700c40) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/netopeer2/src/main.c:976
#13 0x00005592c3a181ee in main (argc=7, argv=0x7ffc1a0c6dd8) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/netopeer2/src/main.c:1322

(gdb) up
#1  0x00007ff66c5c9046 in sys_futex_wait (uaddr=0x7ff66c9bd028, expected=2227, clockid=1, timeout=0x7ffc1a0c6590) at /home/massimo/Progetti/azcom/ORAN/ReferenceRU/az-m-plane-5g/sysrepo/src/sr_cond/sr_cond_futex.c:72
(gdb) print *uaddr
$1 = 2227
(gdb) print expected
$2 = 2227
(gdb) print clockid
$3 = 1
(gdb) print *timeout
$4 = {tv_sec = 226077, tv_nsec = 7296958}

Netopeer2-cli, in the meanwhile (initially the server was not stuck yet):

rossellom@tatanka:/home/massimo/Progetti/azcom/ORAN/ReferenceRU/test$ netopeer2-cli
> connect
> get --help
get [--help] [--filter-subtree[=] | --filter-xpath ] [--defaults report-all|report-all-tagged|trim|explicit] [--out ] [--rpc-timeout ]
> get --filter-xpath /o-ran-uplane-conf:user-plane-configuration/rx-array-carriers
cli_send_recv: Timeout for receiving a reply expired.
> get --filter-xpath /o-ran-uplane-conf:user-plane-configuration/rx-array-carriers
nc ERROR: Received a  with an unexpected message-id 52 (expected 53).
cli_send_recv: Unexpected reply received - ignoring and waiting for the correct reply.
cli_send_recv: Timeout for receiving a reply expired.
> get --filter-xpath /o-ran-uplane-conf:user-plane-configuration/rx-array-carriers
cli_send_recv: Timeout for receiving a reply expired.
> get --filter-xpath /o-ran-uplane-conf:user-plane-configuration/rx-array-carriers
nc ERROR: Received a  with an unexpected message-id 53 (expected 55).
cli_send_recv: Unexpected reply received - ignoring and waiting for the correct reply.
cli_send_recv: Timeout for receiving a reply expired.
> disconnect
> connect

^C
rossellom@tatanka:/home/massimo/Progetti/azcom/ORAN/ReferenceRU/test$ netopeer2-cli
> connect
^C

Given the strange values in the timespec I went a bit upwards, and I found maybe a problem in sr_shmsub_notify_many_wait_wr(). The timeout_abs is not initialized and then 30000 ms get added via sr_timeouttime_get

static sr_error_info_t *
sr_shmsub_notify_many_wait_wr(struct sr_shmsub_many_info_s *notify_subs, uint32_t notify_size, uint32_t notify_count,
        sr_sub_event_t expected_ev, int clear_ev_on_err, sr_cid_t cid, uint32_t timeout_ms)
{
    sr_error_info_t *err_info = NULL, *tmp_err;
    struct sr_shmsub_many_info_s *nsub;
    struct timespec timeout_abs;
    uint32_t i;
    int lock_lost;

    /* compute the timeout */
    sr_timeouttime_get(&timeout_abs, timeout_ms);

I'll try to recompile and retry after having added initialization

michalvasko commented 3 months ago

The timeout_abs is not initialized and then 30000 ms get added via sr_timeouttime_get

sr_timeouttime_get() includes initialization and the "strange" values will most likely be a CLOCK_MONOTONIC timestamp. So, I would say this is irrelevant. And I do not care what error occurs exactly, I just need to reproduce any faulty scenario if I am to fix this. Since it is a data race, it makes sense that something a bit different occurs on each run.

maxrossello commented 3 months ago

Agreed, but the problem is that whenever I attach a debugger, it the assert failure doesn't show up, whereas it does if a debugger is not attached. However, clients' requests go timeout and new connections sometimes fail on ssh authentication.

Here's a sample log where the client finally tries to reconnect

[INF]: NP: Session 94: thread 0 event new RPC.
[INF]: NP: Session 94: thread 0 event reply error.
[INF]: SR: EV ORIGIN: "/ietf-netconf:get" "rpc" ID 844 priority 0 for 1 subscribers published.
[ERR]: SR: EV ORIGIN: SHM event "rpc" ID 844 processing timed out.
[WRN]: SR: EV ORIGIN: "/ietf-netconf:get" "rpc" ID 844 priority 0 failed (Timeout expired).
[ERR]: SR: User callback failed.
[ERR]: NP: Failed to send an RPC (User callback failed).
[INF]: NP: Session 94: thread 0 event new RPC.
[INF]: NP: Session 94: thread 0 event reply error.
[INF]: SR: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 584 priority 0 for 1 subscribers published.
[ERR]: SR: EV ORIGIN: SHM event "rpc" ID 584 processing timed out.
[WRN]: SR: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 584 priority 0 failed (Timeout expired).
[ERR]: SR: User callback failed.
[ERR]: NP: Failed to send an RPC (User callback failed).
[INF]: NP: Session 94: thread 0 event new RPC.
[INF]: NP: Session 94: thread 0 event reply error.
[INF]: NP: Session 94: thread 0 event new RPC.
[INF]: NP: Session 94: thread 0 event session terminated.
[INF]: LN: Accepted a connection on 0.0.0.0:830 from 127.0.0.1:35306.
[INF]: LN: Received an SSH message "request-service" of subtype "ssh-userauth".
[INF]: LN: Received an SSH message "request-auth" of subtype "publickey".
[WRN]: LN: User's "rossellom" public key doesn't match, trying another.
[INF]: LN: User "rossellom" authenticated.
[INF]: LN: Received an SSH message "request-channel-open" of subtype "session".
[INF]: LN: Received an SSH message "request-channel" of subtype "subsystem".
[INF]: SR: Session 9566 (user "root", CID 1) created.
[INF]: SR: EV ORIGIN: "/ietf-netconf:get" "rpc" ID 845 priority 0 for 1 subscribers published.
[ERR]: SR: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 457 was not processed.
[WRN]: NP: Failed to send a notification (Timeout expired).
[2024/03/12 22:40:44.235107, 1] ssh_socket_exception_callback:  Socket exception callback: 1 (0)
[2024/03/12 22:40:44.235120, 1] ssh_socket_exception_callback:  Socket error: disconnected
[ERR]: SR: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 457 was not processed.
[WRN]: NP: Failed to send a notification (Timeout expired).
[ERR]: SR: EV ORIGIN: SHM event "rpc" ID 845 processing timed out.
[WRN]: SR: EV ORIGIN: "/ietf-netconf:get" "rpc" ID 845 priority 0 failed (Timeout expired).
[ERR]: SR: User callback failed.
[ERR]: NP: Failed to send an RPC (User callback failed).
[INF]: NP: Session 95: thread 2 event new RPC.
[INF]: NP: Session 95: thread 2 event reply error.
[INF]: SR: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 585 priority 0 for 1 subscribers published.
[ERR]: SR: EV ORIGIN: SHM event "rpc" ID 585 processing timed out.
[WRN]: SR: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 585 priority 0 failed (Timeout expired).
[ERR]: SR: User callback failed.
[ERR]: NP: Failed to send an RPC (User callback failed).
[INF]: NP: Session 95: thread 2 event new RPC.
[INF]: NP: Session 95: thread 2 event reply error.
[INF]: NP: Session 95: thread 2 event new RPC.
[INF]: NP: Session 95: thread 2 event session terminated.
[INF]: LN: Accepted a connection on 0.0.0.0:830 from 127.0.0.1:41386.
[INF]: LN: Received an SSH message "request-service" of subtype "ssh-userauth".
[INF]: LN: Received an SSH message "request-auth" of subtype "publickey".
[WRN]: LN: User's "rossellom" public key doesn't match, trying another.
[INF]: LN: User "rossellom" authenticated.
[INF]: LN: Received an SSH message "request-channel-open" of subtype "session".
[INF]: LN: Received an SSH message "request-channel" of subtype "subsystem".
[INF]: SR: Session 9567 (user "root", CID 1) created.
[INF]: SR: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 586 priority 0 for 1 subscribers published.
[ERR]: SR: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 457 was not processed.
[ERR]: SR: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 457 was not processed.
[WRN]: NP: Failed to send a notification (Timeout expired).
[WRN]: NP: Failed to send a notification (Timeout expired).
[2024/03/12 22:43:55.431914, 1] ssh_socket_exception_callback:  Socket exception callback: 1 (0)
[2024/03/12 22:43:55.431930, 1] ssh_socket_exception_callback:  Socket error: disconnected
[INF]: LN: Accepted a connection on 0.0.0.0:830 from 127.0.0.1:51930.
[2024/03/12 22:43:55.435561, 1] ssh_socket_exception_callback:  Socket exception callback: 1 (0)
[2024/03/12 22:43:55.435584, 1] ssh_socket_exception_callback:  Socket error: disconnected
[ERR]: LN: SSH key exchange error (Socket error: disconnected).

Regarding the futex thing, I don't know the topic that well, but I read:

This operation atomically verifies that the futex address still contains the value given, and sleeps awaiting FUTEX_WAKE on this futex address. If the timeout argument is non-NULL, its contents describe the maximum duration of the wait, which is infinite otherwise.

so shouldn't the timespec contain { 30, 0 } (for 30000 msec) rather than add them up to the clock timestamp? I do believe, anyways, that this may impact only once the problem already showed up.

michalvasko commented 3 months ago

From the man page:

FUTEX_WAIT_BITSET (since Linux 2.6.25)
              ...

              If timeout is not NULL, the structure it points to specifies an absolute timeout for the wait operation.  If timeout is NULL,  the  opera‐
              tion can block indefinitely.

For the last time, this is a dead end and the shortest way to fix this is if I could reproduce it. Otherwise you are on your own. Also, not sure what versions are you using exactly, most likely the latest release of all the projects (mainly sysrepo) but you could also try to use the devel branch, there are quite a lot of changes already.

maxrossello commented 3 months ago

ok understood. Since I am short of time too, I'll stick to v2.1.71 (and compatible components of course) where everything works like a charm. Good luck

michalvasko commented 3 months ago

Well, okay. Obviously, I would rather you help me fix this but that is up to you.