clicon / clixon

YANG-based toolchain including NETCONF and RESTCONF interfaces and an interactive CLI
http://www.clicon.org/
Other
212 stars 71 forks source link

Calling clixon_backend -z does not (always) stop clixon_restconf #420

Open troglobit opened 1 year ago

troglobit commented 1 year ago

While debugging issues with running the regression tests in PR #418, we noticed that test_restconf_internal.sh often failed, and always at the same step:

Test 41(41) [11. Get restconf status rpc]
Test 42(42) [send rpc status]
grep: warning: stray \ before -
grep: warning: stray \ before -
Test 43(43) [Kill backend]
Feb 10 12:27:07: Killing old daemon with pid: 60371
Test 44(44) [Restart backend -s none]
Test 45(45) [kill old backend]
Test 46(46) [start backend -s none -f /var/tmp/./test_restconf_internal.sh/conf.xml]
Test 47(47) [wait backend]
Test 48(48) [wait restconf]
speed 38400 baud; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
werase = ^W; lnext = ^V; flush = ^O; min = 1; time = 0;
-brkint -imaxbel
Test 49(49) [12. Get restconf (running) after restart]
Test 50(50) [send rpc status]
grep: warning: stray \ before -
grep: warning: stray \ before -

Error in Test50 [send rpc status]:
Expected
^<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="42"><active xmlns="http://clicon.org/lib">true</active><description xmlns="http://clicon.org/lib">Clixon RESTCONF process</description><command xmlns="http://clicon.org/lib">/.*/clixon_restconf -f /var/tmp/./test_restconf_internal.sh/conf.xml -D [0-9] .*</command><status xmlns="http://clicon.org/lib">running</status><starttime xmlns="http://clicon.org/lib">20[0-9][0-9]\-[0-9][0-9]\-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\.[0-9]*Z</starttime><pid xmlns="http://clicon.org/lib">0</pid></rpc-reply>$

Received: 
#889
<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="42"><active xmlns="http://clicon.org/lib">false</active><description xmlns="http://clicon.org/lib">Clixon RESTCONF process</description><command xmlns="http://clicon.org/lib">/usr/local/sbin/clixon_restconf -f /var/tmp/./test_restconf_internal.sh/conf.xml -D 0 -l s -R &lt;restconf xmlns="http://clicon.org/restconf"&gt;&lt;enable&gt;true&lt;/enable&gt;&lt;auth-type&gt;none&lt;/auth-type&gt;&lt;debug&gt;0&lt;/debug&gt;&lt;log-destination&gt;syslog&lt;/log-destination&gt;&lt;enable-core-dump&gt;false&lt;/enable-core-dump&gt;&lt;pretty&gt;false&lt;/pretty&gt;&lt;socket&gt;&lt;namespace&gt;default&lt;/namespace&gt;&lt;address&gt;0.0.0.0&lt;/address&gt;&lt;port&gt;80&lt;/port&gt;&lt;ssl&gt;false&lt;/ssl&gt;&lt;/socket&gt;&lt;/restconf&gt;</command><status xmlns="http://clicon.org/lib">stopped</status></rpc-reply>
##

We've reproduced this on three different machines, all running a Ubuntu 18.04 or 22.04 derivative, despite the tests running in an Alpine Linux docker container. (On an old Fedora 33 server, however, all tests pass ... unclear why, but worth mentioning for reference.)

Turns out that at the end of step "11. Get restconf status rpc", when stop_backend -f $cfg is called, the clixon_backend -z -f $cfg does not stop the clixon_restconf child. So in step "12. Get restconf (running) after restart" the rpcstatus call fails.

Adding a killall "fixes" the problem:

@@ -320,6 +320,7 @@ if [ $BE -ne 0 ]; then
     fi
     # kill backend
     stop_backend -f $cfg
+    killall clixon_restconf
 fi

 # Restconf is enabled and restconf was running but was killed by stop ^.

Obviously this is not the proper fix. Question is why clixon_backend doesn't kill the child it started, or why a new clixon_backend cannot reconnect with an already running clixon_restconf process? I'm not up to speed how that dynamic works, but maybe someone else in the wider Clixon community do.

There were a few other warnings as well, for instance the "stray \ before -" grep warning. This seems to just be a minor regexp mistake (- not needed outside of regexp ranges):

@@ -115,7 +115,7 @@ EOF
         err "No pid return value" "$retx"
     fi
     if $active; then
-        expect="^<rpc-reply $DEFAULTNS><active $LIBNS>$active</active><description $LIBNS>Clixon RESTCONF process</description><command $LIBNS>/.*/clixon_restconf -f $cfg -D [0-9] .*</command><status $LIBNS>$status</status><starttime $LIBNS>20[0-9][0-9]\-[0-9][0-9]\-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\.[0-9]*Z</starttime><pid $LIBNS>$pid</pid></rpc-reply>$"
+        expect="^<rpc-reply $DEFAULTNS><active $LIBNS>$active</active><description $LIBNS>Clixon RESTCONF process</description><command $LIBNS>/.*/clixon_restconf -f $cfg -D [0-9] .*</command><status $LIBNS>$status</status><starttime $LIBNS>20[0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\.[0-9]*Z</starttime><pid $LIBNS>$pid</pid></rpc-reply>$"
     else
         # inactive, no startime or pid
         expect="^<rpc-reply $DEFAULTNS><active $LIBNS>$active</active><description $LIBNS>Clixon RESTCONF process</description><command $LIBNS>/.*/clixon_restconf -f $cfg -D [0-9] .*</command><status $LIBNS>$status</status></rpc-reply>$"

Compared to the Fedora machine we also noticed that restoring the tty with stty also gives some noise. Here's a lib.sh patch:

@@ -638,7 +638,7 @@ function wait_restconf(){
         sleep 2 # some problems with valgrind
     fi

-  stty $STTYSETTINGS
+  stty $STTYSETTINGS >/dev/null

Finally, in lib.sh, there's an additional pkill that seems a bit malplaced:

@@ -554,7 +554,7 @@ function stop_backend(){
         sleep 1
         checkvalgrind
     fi
-    sudo pkill -f clixon_backend # extra ($BUSER?)
+#    sudo pkill -f clixon_backend # extra ($BUSER?)
 }

 # Wait for restconf to stop sending  502 Bad Gateway
olofhagsand commented 1 year ago

I think that test_restconf_internal.sh suffers from some quality problems, ie the test itself, it was somewhat problematic writing it because of timing and multile processes. Thank you for your effort.

troglobit commented 1 year ago

No worries. Do you want us to look into the shutting down of clixon_restconf when the backend receives SIGTERM?

olofhagsand commented 1 year ago

The "internal" restconf is one of three ways to start the restconf daemon, as described here: https://clixon-docs.readthedocs.io/en/latest/restconf.html#datastore ie by forking from the backend. it is probably not the most common scenario (such as starting the restconf daemon separately) but there are products using this method. test_restconf_internal.sh tests this. But as you point out there are some issues. If you want you are welcome to debug this, at least the symtom: what more specifically is wrong, and how to repeat it deterministically. There is some docs here on starting sub-processes.

troglobit commented 1 year ago

Alright, we'll look into it. I'll keep you posted, cheers!