FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.27k stars 1.24k forks source link

zebra_apic threads not started after FRR service restart #16747

Open ToshikiRen opened 1 week ago

ToshikiRen commented 1 week ago

Discussed in https://github.com/FRRouting/frr/discussions/16638

Originally posted by **ToshikiRen** August 23, 2024 zebra_apic threads not started after FRR service restart (happens after multiple restarts, not all the time, the issue occurrence is mostly random) The issue is that no routes are sent from routing daemons (e.g., BGP) to the kernel. Questions: 1. From my understanding the zebra_apic are responsible for communication between the FRR daemons and the linux kernel. Is my assumption correct? 2. What could be the cause for the zebra_apic to not be started? I managed to reproduce the issue on a box with the following configuration, without configuring BGP peers: ``` # default to using syslog. /etc/rsyslog.d/45-frr.conf places the log in # /var/log/frr/frr.log # # Note: # FRR's configuration shell, vtysh, dynamically edits the live, in-memory # configuration while FRR is running. When instructed, vtysh will persist the # live configuration to this file, overwriting its contents. If you want to # avoid this, you can edit this file manually before starting FRR, or instruct # vtysh to write configuration to a different file. log syslog informational ! debug zebra events debug zebra packet recv debug zebra kernel debug zebra rib debug zebra nht debug zebra dplane debug zebra nexthop debug zebra neigh debug bgp nht debug bgp zebra ``` The error from the logs when the issue occurs during restart: ``` bgpd[335825]: [VMFZK-56S5Y] bgp_zebra_label_manager_connect: failed connecting synchronous zclient! ``` FRR version: 9.1 Show version output: ```show version FRRouting 9.1 (come-as4581) on Linux(6.6.32-dent). Copyright 1996-2005 Kunihiro Ishiguro, et al. configured with: '--build=x86_64-linux' '--host=aarch64-dent-linux' '--target=aarch64-dent-linux' '--prefix=/usr' '--exec_prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--libexecdir=/usr/libexec' '--datadir=/usr/share' '--sysconfdir=/etc' '--sharedstatedir=/com' '--localstatedir=/var' '--libdir=/usr/lib' '--includedir=/usr/include' '--oldincludedir=/usr/include' '--infodir=/usr/share/info' '--mandir=/usr/share/man' '--disable-silent-rules' '--disable-dependency-tracking' '--with-libtool-sysroot=' '--sbindir=/usr/libexec/frr' '--sysconfdir=/etc/frr' '--localstatedir=/var/run/frr' '--enable-vtysh' '--enable-multipath=64' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-configfile-mask=0640' '--enable-logfile-mask=0640' '--disable-doc' '--with-clippy=/usr/lib/clippy' '--disable-capabilities' '--disable-cumulus' '--disable-datacenter' '--disable-fpm' '--disable-grpc' '--disable-ospfapi' '--disable-ospfclient' '--with-libpam' '--disable-protobuf' '--disable-snmp' '--disable-zeromq' 'build_alias=x86_64-linux' 'host_alias=aarch64-dent-linux' 'target_alias=aarch64-dent-linux' 'AR=aarch64-dent-linux-gcc-ar' 'LD=aarch64-dent-linux-ld --sysroot= ' 'OBJCOPY=aarch64-dent-linux-objcopy' 'OBJDUMP=aarch64-dent-linux-objdump' 'RANLIB=aarch64-dent-linux-gcc-ranlib' 'STRIP=aarch64-dent-linux-strip' 'PKG_CONFIG_PATH=/usr/lib/pkgconfig:/usr/share/pkgconfig' 'PKG_CONFIG_LIBDIR=/usr/lib/pkgconfig' 'CC=aarch64-dent-linux-gcc -mbranch-protection=standard -fstack-protector-strong -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=' 'CPPFLAGS=' 'CPP=aarch64-dent-linux-gcc -E --sysroot= -mbranch-protection=standard -fstack-protector-strong -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security' 'CXX=aarch64-dent-linux-g++ -mbranch-protection=standard -fstack-protector-strong -O2 -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -Werror=format-security --sysroot=' 'PYTHON=/usr/bin/python3-native/python3' ``` When the issue occurs the zebra zserv.api socket is owned by root instead of frr: ``` # ls -l /var/run/frr/zserv.api srwx------ 1 root frr 0 Aug 28 06:20 /var/run/frr/zserv.api ``` Looking into the code it seems the only case for root to own this socket would be to use a TCP connection but it is not the case for our configuration. I have seen this issue on the latest frr release (10.1) as well.
ToshikiRen commented 1 week ago

I have been using the following script to reproduce the issue. I have modified the original service file to permit faster restarts so we could see the issue faster, the issue appears with the default configuration as well, if anyone wants to reproduce it using the default configuration you can change the sleep to match the service configuration.

#!/bin/bash

while true; do
    zebra_count=$(ps -L -p $(pgrep zebra) | grep zebra_apic | wc -l)

    if [ "$zebra_count" -lt 1 ]; then
        echo "Number of zebra apic thread is less than 1. Exiting..."
        date
    pid=$(ps aux | grep "[j]ournalctl -fu frr" | awk '{print $2}')
    kill -9 $pid
    break
    else
    ps -L -p $(pgrep zebra)
            echo "Restarting FRR..."
        date
    pid=$(ps aux | grep "[j]ournalctl -fu frr" | awk '{print $2}')
    kill -9 $pid
    journalctl -fu frr > frr_log.issue &
    sudo systemctl restart frr
        sleep 11
    fi
done
ton31337 commented 1 week ago

As I understand we should see failed connecting synchronous zclient in frr_log.issue, right? But I can't see it running your script.

ToshikiRen commented 1 week ago

If the threads are missing and the owner of the socket is root, then the issue occured. The missing failed connecting synchronous zclient might be related to the debug settings from frr that you have setup. In case that one is missing you can also look for the:

bgpd[9945]: [TBNSW-XXXBM] sendmsg_zebra_rnh: We have not connected yet, cannot send nexthops
bgpd[9945]: [HXW3G-K1M2A] sendmsg_zebra_rnh: sending cmd ZEBRA_NEXTHOP_REGISTER for 20.20.0.1/32 (vrf VRF default)
bgpd[9945]: [YTHK0-FSPPJ][EC 33554500] sendmsg_nexthop: zclient_send_message() failed
ton31337 commented 1 week ago

Could you test this PR https://github.com/FRRouting/frr/pull/16749?

ToshikiRen commented 1 week ago

I still see the issue where the zserv.api is owned by root instead of the frr user on PR #16749

ToshikiRen commented 1 week ago

I looked with a friend at the code and we think that the issue might be in zserv_start. If something goes wrong here the socket could be bound to user root, but it should not be the case since the socket is an unix socket, right?

~# netstat -an | grep zserv
unix  2      [ ACC ]     STREAM     LISTENING     230024   /var/run/frr/zserv.api

The above netstat -an output is from a device for which the issue occured. When it is working the output is the following:

~# netstat -an | grep zserv
unix  2      [ ACC ]     STREAM     LISTENING     337256   /var/run/frr/zserv.api
unix  3      [ ]         STREAM     CONNECTED     337320   /var/run/frr/zserv.api
ton31337 commented 1 week ago

Could you show the whole /var/run/frr directory?

ToshikiRen commented 1 week ago

This is the entire content of the /run/frr from when the issue occurs:

total 16K
drwxr-xr-x  2 frr  frr    280 Sep  5 14:06 .
drwxr-xr-x 24 root root   920 Sep  5 08:38 ..
-rw-r--r--  1 frr  frr      7 Sep  5 14:06 mgmtd.pid
srwxrwx---  1 frr  frrvty   0 Sep  5 14:06 mgmtd.vty
srwx------  1 frr  frr      0 Sep  5 14:06 mgmtd_be.sock
srwx------  1 frr  frr      0 Sep  5 14:06 mgmtd_fe.sock
-rw-r--r--  1 frr  frr      7 Sep  5 14:06 staticd.pid
srwxrwx---  1 frr  frrvty   0 Sep  5 14:06 staticd.vty
-rw-r--r--  1 root root     7 Sep  5 14:06 watchfrr.pid
-rw-r-----  1 root root     0 Sep  5 14:06 watchfrr.started
srwxrwx---  1 root frrvty   0 Sep  5 14:06 watchfrr.vty
-rw-r--r--  1 frr  frr      7 Sep  5 14:06 zebra.pid
srwxrwx---  1 frr  frrvty   0 Sep  5 14:06 zebra.vty
srwx------  1 root frr      0 Sep  5 14:06 zserv.api
ToshikiRen commented 1 week ago

Can you reopen it since the issue persists? P.S: Please let me know if there is anything else that I may provide in order to help