CESNET / netopeer2

NETCONF toolset
BSD 3-Clause "New" or "Revised" License
301 stars 189 forks source link

Edit-config fails due to previous notification processing error #1402

Closed ankit7gup closed 1 year ago

ankit7gup commented 1 year ago

Hi @michalvasko,

We get below error while edit-config operation and pending notification processing error is there, while going back, I saw a very random pending subscriber no. Please suggest.

2023-05-23 04:36:35.325712 info netopeer2-server[2048]: EV LISTEN: "ietf-netconf-notifications" "notif" ID 4 processing. 2023-05-23 04:36:35.326016 info netopeer2-server[2048]: EV LISTEN: "ietf-netconf-notifications" "notif" ID 4 priority 0 success (remaining 4294967295 subscribers).

2023-05-23 04:36:48.612180 info netopeer2-server[2048]: EV ORIGIN: "o-ran-processing-element" "done" ID 1 priority 0 for 1 subscribers published. 2023-05-23 04:36:48.629946 info netopeer2-ser 2023-05-23 04:48:33.483751 err netopeer2-server[2048]: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 4 was not processed. 2023-05-23 04:48:33.483940 err netopeer2-server[2048]: Failed to generate netconf-config-change notification, but changes were applied. 2023-05-23 04:48:33.484394 info netopeer2-server[2048]: EV LISTEN: "/ietf-netconf:edit-config" "rpc" ID 2 priority 0 fail (remaining 1 subscribers). 2023-05-23 04:48:33.484501 err netopeer2-server[2048]: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 4 was not processed. 2023-05-23 04:48:33.484532 warning netopeer2-server[2048]: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 2 priority 0 failed (Timeout expired). 2023-05-23 04:48:33.485004 err netopeer2-server[2048]: User callback failed. 2023-05-23 04:48:33.485038 err netopeer2-server[2048]: Failed to send an RPC (User callback failed).

Thanks!

michalvasko commented 1 year ago

What sysrepo version are you using? Is it a debug build or release? In any case, this should not happen and is a bug. Can you post preceding output as well? It would be ideal if you know how to reproduce it and provide the steps.

ankit7gup commented 1 year ago

Hi Michal, We are using the master latest release with patch for Libyan https://github.com/CESNET/libyang/commit/3ff82dd469432b0d55c348b31ca8575e59b4d5a2.

Please find attached the complete log file for server. syslog-netopeer_logs.txt

This behavior is random. Not observed everytime.

michalvasko commented 1 year ago

The log has shown the problem quite clearly, try the latest sysrepo devel, it should be fixed.

ankit7gup commented 1 year ago

@michalvasko Can you please share the commit if we can apply that as a patch if possible. Thanks!

michalvasko commented 1 year ago

It is seen right after my previous post, I have referenced this issue directly.

ankit7gup commented 1 year ago

It is seen right after my previous post; I have referenced this issue directly. I hope this is the one. https://github.com/sysrepo/sysrepo/commit/664cdb412841b55517d8039aae0dcc78693b4de8

michalvasko commented 1 year ago

Yes, the same link is literally a few posts above.

ankit7gup commented 1 year ago

Hi Michal,

Now the previous invalid value is not coming, but I am getting below errors.

2023-05-30 06:55:12.004581 PUMA info netopeer2-server[2310]: EV LISTEN: "/ietf-netconf:get" "rpc" ID 10 priority 0 success (remaining 0 subscribers). 2023-05-30 06:55:12.004678 PUMA info netopeer2-server[2310]: EV ORIGIN: "/ietf-netconf:get" "rpc" ID 10 priority 0 succeeded. 2023-05-30 06:55:12.006935 PUMA info netopeer2-server[2310]: Session 1: thread 1 event new RPC. 2023-05-30 06:55:12.519897 PUMA info netopeer2-server[2310]: EV LISTEN: "o-ran-sync" "notif" ID 1 processing. 2023-05-30 06:55:12.520903 PUMA info netopeer2-server[2310]: EV LISTEN: "o-ran-sync" "notif" ID 1 priority 0 success (remaining 0 subscribers). 2023-05-30 06:55:12.533122 PUMA info netopeer2-server[2310]: EV LISTEN: "o-ran-sync" "notif" ID 2 processing. 2023-05-30 06:55:12.534368 PUMA info netopeer2-server[2310]: EV LISTEN: "o-ran-sync" "notif" ID 2 priority 0 success (remaining 0 subscribers). 2023-05-30 06:55:12.535335 PUMA info netopeer2-server[2310]: EV LISTEN: "o-ran-fm" "notif" ID 1 processing. 2023-05-30 06:55:12.536381 PUMA info netopeer2-server[2310]: EV LISTEN: "o-ran-fm" "notif" ID 1 priority 0 success (remaining 0 subscribers). 2023-05-30 06:55:13.050644 PUMA info netopeer2-server[2310]: EV LISTEN: "ietf-hardware" "notif" ID 1 processing. 2023-05-30 06:55:13.051967 PUMA info netopeer2-server[2310]: EV LISTEN: "ietf-hardware" "notif" ID 1 priority 0 success (remaining 0 subscribers). 2023-05-30 06:55:15.750991 PUMA info netopeer2-server[2310]: EV ORIGIN: "/ietf-netconf:get" "rpc" ID 11 priority 0 for 1 subscribers published. 2023-05-30 06:55:15.751408 PUMA info netopeer2-server[2310]: EV LISTEN: "/ietf-netconf:get" "rpc" ID 11 priority 0 processing (remaining 1 subscribers). . . . 2023-05-30 06:55:15.958661 PUMA info netopeer2-server[2310]: EV ORIGIN: "o-ran-processing-element" "done" ID 1 priority 0 succeeded. 2023-05-30 06:55:25.959823 PUMA err netopeer2-server[2310]: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 2 was not processed. 2023-05-30 06:55:25.960015 PUMA err netopeer2-server[2310]: Failed to generate netconf-config-change notification, but changes were applied. 2023-05-30 06:55:25.960415 PUMA info netopeer2-server[2310]: EV LISTEN: "/ietf-netconf:edit-config" "rpc" ID 2 priority 0 fail (remaining 1 subscribers). 2023-05-30 06:55:25.960834 PUMA err netopeer2-server[2310]: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 2 was not processed.

Please suggest.

Thanks!

michalvasko commented 1 year ago

You can again post the whole output but it may simply be that your callback handling the notification is stuck, it then breaks the notifications for the particular module.

ankit7gup commented 1 year ago

You can again post the whole output but it may simply be that your callback handling the notification is stuck, it then breaks the notifications for the particular module.

Please find the complete o/p file attached. netopeer`.txt

michalvasko commented 1 year ago

It is not the whole output, I can see the message

2023-05-29 21:43:18.800970  err netopeer2-server[2312]: Waiting for subscription of "ietf-netconf-notifications" failed, previous event "notif" ID 5 was not processed.

on line 249 but there is no message informing about this event being generated.

EDIT: Okay, with the other "notif" event ID 2 the messages are there.

1970-01-01 00:18:38.648131  info netopeer2-server[2310]: EV LISTEN: "/notifications:create-subscription" "rpc" ID 4 priority 0 processing (remaining 1 subscribers).
1970-01-01 00:18:38.649335  info netopeer2-server[2310]: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 1 priority 0 succeeded.
1970-01-01 00:18:38.649996  info netopeer2-server[2310]: Session 4: thread 0 event new RPC.
1970-01-01 00:18:38.704075  info netopeer2-server[2310]: EV LISTEN: "ietf-netconf-notifications" "notif" ID 2 processing.
1970-01-01 00:18:38.704393  info netopeer2-server[2310]: EV LISTEN: "ietf-netconf-notifications" "notif" ID 2 priority 0 success (remaining 4294967295 subscribers).
1970-01-01 00:18:38.869142  info netopeer2-server[2310]: EV ORIGIN: "/ietf-netconf:get" "rpc" ID 7 priority 0 for 1 subscribers published.
1970-01-01 00:18:39.060677  info netopeer2-server[2310]: EV LISTEN: "/notifications:create-subscription" "rpc" ID 4 priority 0 success (remaining 0 subscribers).
1970-01-01 00:18:39.060789  info netopeer2-server[2310]: EV ORIGIN: "/notifications:create-subscription" "rpc" ID 4 priority 0 succeeded.

and it seems exactly the same as before, are you certain you have the patch applied? Actually, the whole output is weird, there is a message printed when event ID 1 is published

1970-01-01 00:18:38.643754  info netopeer2-server[2310]: EV ORIGIN: "ietf-netconf-notifications" "notif" ID 1 priority 0 for 2 subscribers published.

but no message for ID 2? Makes no sense.

ankit7gup commented 1 year ago

Here, Also, I see same error as previosuly even though we have the patch applied.

1970-01-01 00:18:36.781313 PUMA info netopeer2-server[2312]: EV LISTEN: "ietf-netconf-notifications" "notif" ID 5 processing. 1970-01-01 00:18:36.781651 PUMA info netopeer2-server[2312]: EV LISTEN: "ietf-netconf-notifications" "notif" ID 5 priority 0 success (remaining 4294967295 subscribers).

logs.txt

michalvasko commented 1 year ago

I still do not understand how you could get a log missing some messages but I think there were some problems that could have caused the same problem as before. So try to use commits https://github.com/sysrepo/sysrepo/commit/5e08c3511d1d2c704a7bde8cf58293f00e770426 and https://github.com/sysrepo/sysrepo/commit/d784aa94848af0fa1e0ef80df9488062fe9114d6 as well.

ankit7gup commented 1 year ago

Hi @michalvasko, Now invalid values are not seen, but still RPC fails due to similar reason. Sharing complete log file.

Please check. Thanks! netopeer.txt

ankit7gup commented 1 year ago

Hi Michal, Now invalid values are not seen, but still RPC fails due to similar reason. Sharing complete log file.

Please check. Thanks! netopeer.txt

One more thing I observed from the logs is that before the error starts clock time in logs is mixed up.

2023-05-31 13:16:26.134391 info netopeer2-server[2319]: EV ORIGIN: "ietf-hardware" "change" ID 2 priority 0 succeeded. 2023-05-31 13:16:26.134440 info netopeer2-server[2319]: EV ORIGIN: "ietf-interfaces" "change" ID 1 priority 0 succeeded. 2023-05-31 13:16:26.134467 info netopeer2-server[2319]: EV ORIGIN: "o-ran-performance-management" "change" ID 1 priority 0 succeeded. 2023-05-31 13:16:26.157657 info netopeer2-server[2319]: EV ORIGIN: "ietf-hardware" "done" ID 2 priority 0 for 1 subscribers published. 2023-05-31 13:16:26.158310 info netopeer2-server[2319]: EV ORIGIN: "ietf-interfaces" "done" ID 1 priority 0 for 1 subscribers published. 2023-05-31 13:16:26.158741 info netopeer2-server[2319]: EV ORIGIN: "o-ran-performance-management" "done" ID 1 priority 0 for 1 subscribers published. 2023-05-31 13:15:49.294948 info netopeer2-server[2319]: EV ORIGIN: "ietf-hardware" "done" ID 2 priority 0 succeeded. 2023-05-31 13:15:49.294998 info netopeer2-server[2319]: EV ORIGIN: "ietf-interfaces" "done" ID 1 priority 0 succeeded. 2023-05-31 13:15:49.295025 info netopeer2-server[2319]: EV ORIGIN: "o-ran-performance-management" "done" ID 1 priority 0 succeeded. 2023-05-31 13:15:49.309744 info netopeer2-server[2319]: EV ORIGIN: "ietf-netconf-notifications" "notif" ID 1 priority 0 for 1 subscribers published. 2023-05-31 13:15:49.335614 info netopeer2-server[2319]: EV LISTEN: "/ietf-netconf:edit-config" "rpc" ID 1 priority 0 success (remaining 0 subscribers). 2023-05-31 13:15:49.335989 info netopeer2-server[2319]: EV ORIGIN: "/ietf-netconf:edit-config" "rpc" ID 1 priority 0 succeeded. 2023-05-31 13:15:49.336842 info netopeer2-server[2319]: Session 1: thread 2 event new RPC. 2023-05-31 13:15:49.337792 info netopeer2-server[2319]: EV LISTEN: "ietf-netconf-notifications" "notif" ID 1 processing. 2023-05-31 13:15:49.338187 info netopeer2-server[2319]: EV LISTEN: "ietf-netconf-notifications" "notif" ID 1 ignored, created after the notification. 2023-05-31 13:15:49.338311 info netopeer2-server[2319]: EV LISTEN: "ietf-netconf-notifications" "notif" ID 1 priority 0 success (remaining 1 subscribers).

ankit7gup commented 1 year ago

Hi @michalvasko,

Any update on this issue. Thanks!

Regards, Ankit

michalvasko commented 1 year ago

Please be patient, I cannot always immediately react to every issue, I have been fixing another one. But you seem to be right about the clocks

2023-05-31 13:16:26.158741  info netopeer2-server[2319]: EV ORIGIN: "o-ran-performance-management" "done" ID 1 priority 0 for 1 subscribers published.
2023-05-31 13:15:49.294948  info netopeer2-server[2319]: EV ORIGIN: "ietf-hardware" "done" ID 2 priority 0 succeeded.
...
2023-05-31 13:15:49.978573  info netopeer2-server[2319]: Session 3: thread 1 event new RPC.
2023-05-31 13:16:27.063334  info netopeer2-server[2319]: Session 2: Received an SSH message "request-channel-open" of subtype "session".

This is most weird and seems broken, for almost a second your system time has been moved back 37s. I have used real time timestamps for avoiding a data race before that resulted in a previous error and this behavior breaks it. I will try to somehow avoid completely breaking sysrepo subscriptions like this but notifications are based on real-time so if you do some wild jumps like this, they are not going to work correctly no matter what.

ankit7gup commented 1 year ago

Please be patient, I cannot always immediately react to every issue, I have been fixing another one. But you seem to be right about the clocks

2023-05-31 13:16:26.158741  info netopeer2-server[2319]: EV ORIGIN: "o-ran-performance-management" "done" ID 1 priority 0 for 1 subscribers published.
2023-05-31 13:15:49.294948  info netopeer2-server[2319]: EV ORIGIN: "ietf-hardware" "done" ID 2 priority 0 succeeded.
...
2023-05-31 13:15:49.978573  info netopeer2-server[2319]: Session 3: thread 1 event new RPC.
2023-05-31 13:16:27.063334  info netopeer2-server[2319]: Session 2: Received an SSH message "request-channel-open" of subtype "session".

This is most weird and seems broken, for almost a second your system time has been moved back 37s. I have used real time timestamps for avoiding a data race before that resulted in a previous error and this behavior breaks it. I will try to somehow avoid completely breaking sysrepo subscriptions like this but notifications are based on real-time so if you do some wild jumps like this, they are not going to work correctly no matter what.

Thanks and sorry for rushing. It is quite critical for us, thus asked.

michalvasko commented 1 year ago

I have not really come up with any simple solution so some non-trivial mechanism would have to be added which I am not too keen on implementing. So are these time jumps something common and intentional in your setup, is there some valid reason for them? As long as they do not occur, it should work fine.

jktjkt commented 1 year ago

This is CLOCK_REALTIME, right? One use case which might still break it is NTP synchronization and a host which does not have RTC clock hardware (think embedded with no RTC battery). On such HW, it's common to see a big jump in time once the NTP client has synced the time, and that can only happen after the network has been configured, and that might need an app which uses sysrepo to look up the network settings -- hence the problem.

(Fortunately, it doesn't affect our boxes because they have the RTC clock and battery, except on a first boot, and because we pregenerate network config for systemd-networkd.)

michalvasko commented 1 year ago

There are lots of preconditions for this problem to occur so it is much less common than it may seem. I am aware of NTP time jumps but in this particular case the time has soon jumped back so it really seems like an error. But the core of the matter remains the same, notifications are designed to use real time timestamps as a critical piece of information with some natural assumptions as to their behavior. If your device is liable to break them you should probably not be using them.

@jktjkt In the use-case you mentioned, it should not be too difficult to set it all up so that only after NTP is running notification subscriptions are made, for example. Because they do, in fact, have a strict dependency on correct real time on your system.

ankit7gup commented 1 year ago

I have not really come up with any simple solution so some non-trivial mechanism would have to be added which I am not too keen on implementing. So are these time jumps something common and intentional in your setup, is there some valid reason for them? As long as they do not occur, it should work fine.

These jumps were due to some error/bug with PTP clock. But One time jump still would be a common scenario for us i.e. from the system time changes once while netopeer2-server is still running to sync with master clock. But not again.

michalvasko commented 1 year ago

So I have implemented a fairly simple but messy solution to this, the check is now based on monotonic clock so it should not be affected by these problems. But what I said earlier is definitely true at least when using notification replay, unlucky realtime jumps would cause chaos.

ankit7gup commented 1 year ago

Thanks! Will try this solution.