Open ToshikiRen opened 2 months ago
I have been using the following script to reproduce the issue. I have modified the original service file to permit faster restarts so we could see the issue faster, the issue appears with the default configuration as well, if anyone wants to reproduce it using the default configuration you can change the sleep to match the service configuration.
#!/bin/bash
while true; do
zebra_count=$(ps -L -p $(pgrep zebra) | grep zebra_apic | wc -l)
if [ "$zebra_count" -lt 1 ]; then
echo "Number of zebra apic thread is less than 1. Exiting..."
date
pid=$(ps aux | grep "[j]ournalctl -fu frr" | awk '{print $2}')
kill -9 $pid
break
else
ps -L -p $(pgrep zebra)
echo "Restarting FRR..."
date
pid=$(ps aux | grep "[j]ournalctl -fu frr" | awk '{print $2}')
kill -9 $pid
journalctl -fu frr > frr_log.issue &
sudo systemctl restart frr
sleep 11
fi
done
As I understand we should see failed connecting synchronous zclient
in frr_log.issue
, right? But I can't see it running your script.
If the threads are missing and the owner of the socket is root, then the issue occured. The missing failed connecting synchronous zclient
might be related to the debug settings from frr that you have setup. In case that one is missing you can also look for the:
bgpd[9945]: [TBNSW-XXXBM] sendmsg_zebra_rnh: We have not connected yet, cannot send nexthops
bgpd[9945]: [HXW3G-K1M2A] sendmsg_zebra_rnh: sending cmd ZEBRA_NEXTHOP_REGISTER for 20.20.0.1/32 (vrf VRF default)
bgpd[9945]: [YTHK0-FSPPJ][EC 33554500] sendmsg_nexthop: zclient_send_message() failed
Could you test this PR https://github.com/FRRouting/frr/pull/16749?
I still see the issue where the zserv.api
is owned by root instead of the frr
user on PR #16749
I looked with a friend at the code and we think that the issue might be in zserv_start
. If something goes wrong here the socket could be bound to user root
, but it should not be the case since the socket is an unix socket, right?
~# netstat -an | grep zserv
unix 2 [ ACC ] STREAM LISTENING 230024 /var/run/frr/zserv.api
The above netstat -an
output is from a device for which the issue occured. When it is working the output is the following:
~# netstat -an | grep zserv
unix 2 [ ACC ] STREAM LISTENING 337256 /var/run/frr/zserv.api
unix 3 [ ] STREAM CONNECTED 337320 /var/run/frr/zserv.api
Could you show the whole /var/run/frr
directory?
This is the entire content of the /run/frr
from when the issue occurs:
total 16K
drwxr-xr-x 2 frr frr 280 Sep 5 14:06 .
drwxr-xr-x 24 root root 920 Sep 5 08:38 ..
-rw-r--r-- 1 frr frr 7 Sep 5 14:06 mgmtd.pid
srwxrwx--- 1 frr frrvty 0 Sep 5 14:06 mgmtd.vty
srwx------ 1 frr frr 0 Sep 5 14:06 mgmtd_be.sock
srwx------ 1 frr frr 0 Sep 5 14:06 mgmtd_fe.sock
-rw-r--r-- 1 frr frr 7 Sep 5 14:06 staticd.pid
srwxrwx--- 1 frr frrvty 0 Sep 5 14:06 staticd.vty
-rw-r--r-- 1 root root 7 Sep 5 14:06 watchfrr.pid
-rw-r----- 1 root root 0 Sep 5 14:06 watchfrr.started
srwxrwx--- 1 root frrvty 0 Sep 5 14:06 watchfrr.vty
-rw-r--r-- 1 frr frr 7 Sep 5 14:06 zebra.pid
srwxrwx--- 1 frr frrvty 0 Sep 5 14:06 zebra.vty
srwx------ 1 root frr 0 Sep 5 14:06 zserv.api
Can you reopen it since the issue persists? P.S: Please let me know if there is anything else that I may provide in order to help
I'm also seeing the wrong permissions issue for zserv.api
. This is currently theory-crafted, because I don't haven't reviewed the code yet to prove this is the case:
--enable-capabilities
).Currently testing that theory by running the restart script with FRR built with capabilities enabled.
@ToshikiRen do you have capabilities enabled in your package? If not, this might be the issue.
Actually I just saw the thread, and you do have them disabled.
The more I look at the code, the more I am convinced that for privs_per_process = true
it does not ensure that the code run with privs = NULL
is actually run without any privs.
AFAICT there is nothing preventing another process from raising the privs, or it being run while privs are raised in these cases, just that raising and lowering privs isn't done in parallel.
And it also fits the observed state, the socket was created as root
instead of frr
.
Script still running without failing in a loop.
Looking at https://github.com/FRRouting/frr/commit/7bfe765ae06fcc0a5570fdd793237e5fa828f7e7, building without it is already highly discouraged, so maybe libcap just should be mandatory, since it isn't just a performance penalty, but actually broken (at least in 9.1+).
Might be easier than trying to fix this.
@eqvinox since you authored that commit, what's your preference here? Trying to fix the per-process privileges code to ensure that zserv creates the unix socket in an unprivileged context (and make it even slower), or make libcap mandatory (and probably rip out the per-process code).
I checked the code, and AFAICT zebra's zserv is the only user with NULL privileges here. The code seems to have been that way since quite a while (I only checked 8.2). I did not check if there are differences preventing it on older versions (e.g. startup order/non-existent threading).
I'm trying an approach that may help with this, in the PR mentioned just above.
See also #17420 , where we can track the more general problem in the per-process privs case.
Discussed in https://github.com/FRRouting/frr/discussions/16638