CESNET / netopeer2

NETCONF toolset
BSD 3-Clause "New" or "Revised" License
296 stars 187 forks source link

Configured subscription integration #1440

Open jeremie6wind opened 1 year ago

jeremie6wind commented 1 year ago

Hello, Several commits to handle configured subscriptions. To test this, it was necessary to make manual tests:

jeremie6wind commented 1 year ago

Hello, Thank you for for your feedback. These commits fix all the points you mentioned:

jeremie6wind commented 11 months ago

Hi, I added a test in test_sub_ntf_advanced.c It helped me fix some issues. However, I had 2 things I was not sure: 1) In yang-push.c sr_get_data in yang_push_notif_update_send failed for configured yang push. 2) When doing a reset of a receiver, I was not sure how to do it properly. Especially when Netopeer2 checks the operational data. so i added a test in np2srv_oper_sub_ntf_subscriptions_cb. Thank you. Best regards. Jeremie

michalvasko commented 11 months ago

Firstly, I have rebased this branch to the current devel, you will need the same branch of sysrepo for it to compile. There could have been some useful fixes, more below. However, I have accidentally also merged the commits from this PR when resolving the conflicts so you may want to reverse that, sorry about that. Regarding your questions:

  1. Any more details? Perhaps there is a lock problem, similar issue has been recently fixed so you just try again now.
  2. What you are doing is fine except the notification subscription-started should be sent when moving the subscription from connecting to active state, I do not see it in csn_receiver_reset() unless I missed something. Other than that, operational data will not be readable because you are holding INFO_WLOCK for the whole duration of the action callback. Ideally, you should release it after setting the state to connecting and then reacquire it (safely, being aware that the info structure could have changed in the meantime) to set it to active.

So, thanks for the tests and everything, I have made some minor changes to be able to compile it and the tests produce some meaningful output so please commit it with some of your commits (the author is not important).

I think you can create a separate test file for these tests because there should be several of them. I thought you are actually going to test sending and receiving the notifications but I assume you did that only with your proprietary notification receiver because there is none in netopeer2. Which brings me to my (hopefully) last larger request, add support for notification receiver somewhere to our projects so that it can be properly tested, both manually and in the test suite (so that there is not just setting and reading the configuration). Ideally into libnetconf2 and then it can be used with ease in both cases (netopeer2 test suite and netopeer2-cli).

Other than that I have briefly went over the code and it seemed fine, no major changes will be required, I think. But I will not know unless I can actually see the notification being send and received. Thanks again for all your effort.

jeremie6wind commented 11 months ago

Hi, About my issues of last time:

  1. I would say something is missing in the initialization of the sysrepo session used for configured yang push.
  2. I added some subscription-started in the code.

As you suggested, I reverted the commit with the test. I added your improvements. Then I added a new test test_configured_subscriptions. The test itself starts unyte udp collectors, that receive notifications. So netopeer2-server sends notifications and the test parses the notifications. However, there are still some memory leaks in the test due to unyte_udp_collector.

One implementation of the notification receiver is https://github.com/pmacct/pmacct/tree/master. This implementation uses https://github.com/network-analytics/udp-notif-c-collector.

michalvasko commented 11 months ago

Sorry, but the test is failing for me

[==========] tests: Running 12 test(s).
Address type: IPv6 | 12345
Address type: IPv6 | 12346
Address type: IPv6 | 12347
[ RUN      ] test_configured_subscriptions_receivers
[       OK ] test_configured_subscriptions_receivers
[ RUN      ] test_configured_subscriptions_receivers_modif
[       OK ] test_configured_subscriptions_receivers_modif
[ RUN      ] test_configured_subscriptions_add
tc_12345.xml_data[0][0] != '\0'
[   LINE   ] --- /home/vasko/Documents/netopeer2/tests/test_configured_subscriptions.c:768: error: Failure!Aborted (core dumped)

and I found no simple way of fixing it, seems no notifications are being received for some reason. Other than that, they could use some improvements. Mainly using pthread barriers, for example, for exact synchronization instead of fixed usleep(). Other than that, I think there are way too many few-line functions but that does not matter much, I can fix that afterwards if I really think it necessary.

So, please make sure the tests pass. If they do for you, I should be able to find out what is wrong with some assistance.

jeremie6wind commented 11 months ago

Hi Mishal, Thank you for your reply. I put the csn_send_notif_one inside the 'if' as you suggested. I changed the logs to "Could not send notification ." Also I added a commit that makes the test configured yang push fail.

However you may see it once the other tests pass.

The tests pass on my machine except with valgrind, as I explained earlier. I have no idea why the tests fail on your side. I can help. I have to let you know that I won't be available next week.

jeremie6wind commented 11 months ago

FYI On my side I get: Address type: IPv4 | 12345 Address type: IPv4 | 12346 Address type: IPv4 | 12347

michalvasko commented 11 months ago

Thanks for the changes.

Address type: IPv4 | 12345

That is interesting, any reason why it would not work on IPv6? Because it seems that no notifications are being received.

I have to let you know that I won't be available next week.

No problem, me neither, actually.

jeremie6wind commented 10 months ago

Ok so the udp collector has also 127.0.0.1 local address in the test. The test is fully ipv4 so far. usleep have been replaced by pthread_barrier.

michalvasko commented 10 months ago

I have tested the latest changes and this is the test output

[==========] tests: Running 12 test(s).
Address type: IPv4 | 12345
Address type: IPv4 | 12346
Address type: IPv4 | 12347
[ RUN      ] test_configured_subscriptions_receivers
[       OK ] test_configured_subscriptions_receivers
[ RUN      ] test_configured_subscriptions_receivers_modif
[       OK ] test_configured_subscriptions_receivers_modif
[ RUN      ] test_configured_subscriptions_add
[       OK ] test_configured_subscriptions_add
[ RUN      ] test_configured_subscriptions_reset
[       OK ] test_configured_subscriptions_reset
[ RUN      ] test_configured_subscriptions_modif
[       OK ] test_configured_subscriptions_modif
[ RUN      ] test_configured_subscriptions_modif2
[       OK ] test_configured_subscriptions_modif2
[ RUN      ] test_configured_subscriptions_modif3
[       OK ] test_configured_subscriptions_modif3
[ RUN      ] test_configured_subscriptions_yang_push
[       OK ] test_configured_subscriptions_yang_push
[ RUN      ] test_configured_subscriptions_yang_push_modif
[       OK ] test_configured_subscriptions_yang_push_modif
[ RUN      ] test_configured_subscriptions_back
[       OK ] test_configured_subscriptions_back
[ RUN      ] test_configured_subscriptions_remove
[       OK ] test_configured_subscriptions_remove
[ RUN      ] test_configured_subscriptions_remove_receivers
[       OK ] test_configured_subscriptions_remove_receivers
[ERR] Waiting on a conditional variable failed (sr_remove_modules: Connection timed out).
sr_remove_module() failed (Timeout expired)
[  FAILED  ] GROUP TEARDOWN
[  ERROR   ] tests
[==========] tests: 12 test(s) run.
[  PASSED  ] 12 test(s).

but to get there several fixes were required. I have changed the subscription to configuration changes to support failing with an error because that is what had been happening to me but the error was ignored. I have fixed the error, though, but I think it will be better this way. Question is whether not to remove SR_SUBSCR_DONE_ONLY from all the subscriptions so they can properly fail on any error (I have used it before just to simplify things but I probably shouldn't).

Next, I did some refactoring and additional error checking. The latter could definitely use some improvements so that all the callbacks return proper error to the client and ideally print it as well. Naturally, no errors should be ignored either, so please fix all that.

I tried to look at the current problem but the tests need some work before they can be nicely debugged. But I did notice that there was still a timer thread running in the server so there is likely a problem with stopping the subscriptions. Also, the tests must be independent of each other, each must pass separately, and not rely on any particular setup (the error I had before in the 8th test was caused by some missing data in sysrepo).

Finally, some more refactoring would be nice, to use static functions where applicable and document them all. The files also got quite big so it may make sense to separate some code to another file or simply refactor it so that one can still make sense of it. But if you are fine with it, I can finish this myself after everything works as it should. Again, thanks for the effort.

jeremie6wind commented 10 months ago

Thank you for the feedback. I am currently improving the code accordingly.

However, regarding the timer_helperthread, Isn't it normal to have this timer thread running. Maybe it is due to the implementation of timer* functions? According to my tests, as soon as it starts using timer_create, this thread appears. The thread doesn't disappear even after timer_delete.

michalvasko commented 10 months ago

According to my tests, as soon as it starts using timer_create, this thread appears. The thread doesn't disappear even after timer_delete.

Okay, you have looked at it more than I have, seems fine then.

jeremie6wind commented 10 months ago

Hi, In these new commits: I add the fixes that you provided. However one thing was not working on my side with: if (event != SR_EV_CHANGE) { the callback np2srv_config_subscriptions_cb was never called. I had to replace by: if (event != SR_EV_DONE) {

I added the missing return codes and logs, so no error is ignored anymore. I made the tests independent from each others. I reviewed the udp notifications and now I check the content of most of them. Some of the csn_* functions are now static with their documentation header.

CMakeLists.txt is also updated. Thus we can use a custom udp-notif-c-collector library.

michalvasko commented 10 months ago

However one thing was not working on my side with: if (event != SR_EV_CHANGE) { the callback np2srv_config_subscriptions_cb was never called. I had to replace by: if (event != SR_EV_DONE) {

What exactly means "not working"? The only difference it could make is that when the callback is called, some module locks are still held and can potentially block API calls you make from the callback. The other being the ability to return a proper error.

So, I wanted to test this but it is not even possible to compile, you left out some final changes of a test

[ 68%] Building C object tests/CMakeFiles/test_configured_subscriptions.dir/test_configured_subscriptions.c.o
/home/vasko/Documents/netopeer2/tests/test_configured_subscriptions.c: In function ‘delete_subscription_config’:
/home/vasko/Documents/netopeer2/tests/test_configured_subscriptions.c:274:5: error: ‘result’ undeclared (first use in this function)
  274 |     result = strcmp(st->str, EMPTY_GETCONFIG);
      |     ^~~~~~
/home/vasko/Documents/netopeer2/tests/test_configured_subscriptions.c:274:5: note: each undeclared identifier is reported only once for each function it appears in
/home/vasko/Documents/netopeer2/tests/test_configured_subscriptions.c:278:5: error: ‘result2’ undeclared (first use in this function)
  278 |     result2 = strcmp(st->str, EMPTY_GET);

I have tried to fix it the way I thought was intended but it did not work, so please finish it.

jeremie6wind commented 10 months ago

Sorry, it seems I made some mistakes. I could test again with if (event != SR_EV_CHANGE) { and it worked fine. Also I fixed the build issue.

michalvasko commented 10 months ago

When I try to run all the tests, they get stuck at the global_teardown() module removal and time out. Do they work for you? It seems context read locks are not being released by the server, there are 5 of them at this point. Also, the test itself seems to have up to 6 leftover threads of the UDP notif library, performing some cleanup periodically and stuff like that, would expect them to be terminated by this time.

jeremie6wind commented 10 months ago

When running all the tests, this is the output:

15/37 Testing: test_configured_subscriptions 15/37 Test: test_configured_subscriptions Command: "/root/Netopeer2/build/tests/test_configured_subscriptions" Directory: /root/Netopeer2/build "test_configured_subscriptions" start time: Sep 12 10:49 CEST Output:

[==========] Running 12 test(s). Address type: IPv4 | 12345 Address type: IPv4 | 12346 Address type: IPv4 | 12347 [ RUN ] test_configured_subscriptions_receivers [ OK ] test_configured_subscriptions_receivers [ RUN ] test_configured_subscriptions_receivers_modif [ OK ] test_configured_subscriptions_receivers_modif [ RUN ] test_configured_subscriptions_add [ OK ] test_configured_subscriptions_add [ RUN ] test_configured_subscriptions_reset [ OK ] test_configured_subscriptions_reset [ RUN ] test_configured_subscriptions_modif [ OK ] test_configured_subscriptions_modif [ RUN ] test_configured_subscriptions_modif2 [ OK ] test_configured_subscriptions_modif2 [ RUN ] test_configured_subscriptions_modif3 [ OK ] test_configured_subscriptions_modif3 [ RUN ] test_configured_subscriptions_yang_push [ OK ] test_configured_subscriptions_yang_push [ RUN ] test_configured_subscriptions_yang_push_modif [ OK ] test_configured_subscriptions_yang_push_modif [ RUN ] test_configured_subscriptions_back [ OK ] test_configured_subscriptions_back [ RUN ] test_configured_subscriptions_remove [ OK ] test_configured_subscriptions_remove [ RUN ] test_configured_subscriptions_remove_receivers [ OK ] test_configured_subscriptions_remove_receivers [ERR] Waiting on a conditional variable failed (sr_remove_modules: Connection timed out). [ FAILED ] GROUP TEARDOWN [ ERROR ] tests sr_remove_module() failed (Timeout expired) [==========] 12 test(s) run. [ PASSED ] 12 test(s).

Test time = 17.61 sec ---------------------------------------------------------- Test Passed. "test_configured_subscriptions" end time: Sep 12 10:49 CEST "test_configured_subscriptions" time elapsed: 00:00:17 ---------------------------------------------------------- [...] 32/37 Testing: test_configured_subscriptions_valgrind 32/37 Test: test_configured_subscriptions_valgrind Command: "/usr/bin/valgrind" "--leak-check=full" "--show-leak-kinds=all" "--error-exitcode=1" "/root/Netopeer2/build/tests/test_configured_subscriptions" Directory: /root/Netopeer2/build "test_configured_subscriptions_valgrind" start time: Sep 12 10:52 CEST Output: ---------------------------------------------------------- ==14685== Memcheck, a memory error detector ==14685== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==14685== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info ==14685== Command: /root/Netopeer2/build/tests/test_configured_subscriptions ==14685== [==========] Running 12 test(s). Address type: IPv4 | 12345 Address type: IPv4 | 12346 Address type: IPv4 | 12347 [ RUN ] test_configured_subscriptions_receivers [ OK ] test_configured_subscriptions_receivers [ RUN ] test_configured_subscriptions_receivers_modif [ OK ] test_configured_subscriptions_receivers_modif [ RUN ] test_configured_subscriptions_add [ OK ] test_configured_subscriptions_add [ RUN ] test_configured_subscriptions_reset [ OK ] test_configured_subscriptions_reset [ RUN ] test_configured_subscriptions_modif [ OK ] test_configured_subscriptions_modif [ RUN ] test_configured_subscriptions_modif2 [ OK ] test_configured_subscriptions_modif2 [ RUN ] test_configured_subscriptions_modif3 [ OK ] test_configured_subscriptions_modif3 [ RUN ] test_configured_subscriptions_yang_push [ OK ] test_configured_subscriptions_yang_push [ RUN ] test_configured_subscriptions_yang_push_modif [ OK ] test_configured_subscriptions_yang_push_modif [ RUN ] test_configured_subscriptions_back [ OK ] test_configured_subscriptions_back [ RUN ] test_configured_subscriptions_remove [ OK ] test_configured_subscriptions_remove [ RUN ] test_configured_subscriptions_remove_receivers [ OK ] test_configured_subscriptions_remove_receivers [ERR] Waiting on a conditional variable failed (sr_remove_modules: Connection timed out). [ FAILED ] GROUP TEARDOWN [ ERROR ] tests sr_remove_module() failed (Timeout expired) [==========] 12 test(s) run. [ PASSED ] 12 test(s). When executing the test, I sometimes get this output: [WRN] Recovered a read-lock of CID 12 (sr_remove_modules). There are some memory leaks in the test itself due to the udp collector set to receive the udp notifications. When removing the listener in the test, there are no leaks anymore.
michalvasko commented 10 months ago

Well, yes, my output is the same and like I said, this is not okay despite the print that all the tests passed. The following module removal must not time out and instead succeed. So there are still some lock problems, please fix them, with some basic information I provided in my previous reply.

As for the listener leaks, please remove it then, in the teardown or wherever, the tests need to pass with valgrind and there is generally no good reason for not cleaning up after yourself.

jeremie6wind commented 10 months ago

Ok, I replaced the collector code by simpler udp socket code. This removes the valgrind issues. Also I fixed the issue with the timeout. It was a missing sr_release_data. Finally, the last patch fixes a wrong index:

-            sub_id = sub->sub_ids[0];
+            sub_id = sub->sub_ids[idx];
jeremie6wind commented 10 months ago

Thanks for your last answer. I fixed the code according to the previous comments. I will work on the last suggestions and get back to you if it is necessary to schedule a call.

michalvasko commented 10 months ago

An important update. After some discussion with @jktjkt it has been decided that no notification subscriptions will be supported directly by netopeer2 and instead implemented by a new daemon sysrepo-notifd in sysrepo that will pass any relevant notifications to the respective server such as netopeer2. This solution has many advantages:

The relevant consequence of this for you is that once the daemon is finished, this is my idea of how UDP notifications will work. There will be a separate project implementing a "UDP server" (can even be hosted by 6wind directly and netopeer2 will mention a link) that will install all its YANG modules for configured subscriptions/UDP transport on its own into sysrepo. Then, it will handle any configuration changes in the modules and send a specific RPC that will create a new subscription to sysrepo-notifd. It will be very similar to establish-subscription except also include a path to a file-descriptor that the UDP server will listen on and read the actual notifications from. These will then be sent to the configured clients. Note that the subscriptions could still be created by NETCONF (or any other source of YANG data) but they will not be handled specially in netopeer2-server at all, the configuration data will simply be written into sysrepo and then picked up by the UDP server.

So, it is up to you to decide what you want to do in the meantime. The last changes we talked about are no longer relevant since it will all be in a separate project but most of the functionality you implemented will still be used. I should start the implementation of sysrepo-notifd today but it is hard to guess how long it will take, several weeks at least, maybe even a few months. Also, during that time the design may change so I cannot guarantee it will all work the way I have just described.

jeremie6wind commented 10 months ago

Ok, I used SR_SUBSCR_OPER_MERGE and I kept only one oper callback for the moment. I could remove np2srv_oper_sub_ntf_receivers_cb. I read your last comment. We ll discuss internally to see how we proceed with sysrepo-notifd and "UDP server".

jlesk commented 9 months ago

Hi, I am very interested in the changes that you described last time regarding sysrepo-notifd and 'UDP server'. I wanted to know if you had any updates on this subject? Also do you intend to push the first commits of this pull request to the devel branch? Finally, tell me if I can be of any help in this. Thanks

michalvasko commented 9 months ago

I have made a fair progress with sysrepo-notifd (some basic tests are already working) but had to postpone it because of some other work for maybe a week now. Unfortunately, it is really difficult to estimate when it may be finished and merged into sysrepo, into the devel branch so you will have to wait. Hopefully, the next week I will resume working on it. Appreciate the offer to help but there is no way you can help me, not sure about the UDP notification server.

But I can mention some relevant information that is unlikely to change now. To use sysrepo-notifd, one needs to create a named pipe and then send an RPC /sysrepo-notif:notif-subscribe with its path and optional NACM username (via sysrepo). After that the pipe can be read for the actual notifications but there will be utility functions available for either polling and reading the notifications or for creating a new thread with a callback that will automatically read new notifications. The main work left for this UDP server will be to properly implement all the special behavior of configured notifications, sysrepo-notifd will support only the base dynamic subscriptions (there are some additional special notifications and other things with configured feature in ietf-subscribed-notifications).

jktjkt commented 9 months ago

Thanks for the update.

one needs to create a named pipe and then send an RPC /sysrepo-notif:notif-subscribe with its path and optional NACM username (via sysrepo)

Any particular reason for a YANG-level RPC here, and for a pipe that's visible in the filesystem? I would actually prefer a C library call that would just return a FD for reading. That way, there's:

michalvasko commented 9 months ago

The reason being that that is how we designed it before. Also, it would allow for more code being moved to the new daemon (such as operational data reporting) but your worries seem valid and the proposed change should work so I will try to implement it.