FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.38k stars 1.26k forks source link

frr service restart fails with zebra error- zclient_send_message: buffer_write failed to zclient error #17475

Open arjunramu opened 5 days ago

arjunramu commented 5 days ago

Description

frr.log

Our setup:

Site A: Ubuntu 22.04 Linux VM and FRR for BGP Site B: Catalyst Router FRR configured with a very minimalistic config - just to exchange routes with neighbors

bgp session established between A to B. systemctl restart frr failed after bgp session was established.

Issue: FRR service restart failing with zebra error- zclient_send_message: buffer_write failed to zclient error

root@10-1-1-1:~# systemctl restart frr
root@10-1-1-1:~# systemctl status frr
× frr.service - FRRouting
     Loaded: loaded (/lib/systemd/system/frr.service; enabled; vendor preset: enabled)
     Active: failed (Result: start-limit-hit) since Tue 2024-11-12 07:48:55 UTC; 57min ago
       Docs: https://frrouting.readthedocs.io/en/latest/setup.html
    Process: 1494090 ExecStart=/usr/lib/frr/frrinit.sh start (code=exited, status=0/SUCCESS)
    Process: 1495628 ExecStop=/usr/lib/frr/frrinit.sh stop (code=exited, status=0/SUCCESS)
   Main PID: 1494100 (code=exited, status=0/SUCCESS)
        CPU: 1.890s

bgp configuration -

master-10-1-1-1# sh running-config 
Building configuration...

Current configuration:
!
frr version 8.1
frr defaults traditional
hostname master-10-1-1-1
log file /var/log/frr/bgpd.log
log syslog
no ipv6 forwarding
bgp no-rib
service integrated-vtysh-config
username root nopassword
!
router bgp 64512
 bgp router-id 10.1.1.1
 neighbor 10.1.1.2 remote-as 64512
 neighbor 10.1.1.2 update-source 10.1.1.1
 neighbor 10.1.1.2 timers 60 180
 !
 address-family ipv4 unicast
  neighbor 10.1.1.2 next-hop-self
  neighbor 10.1.1.2 route-map DENYALL in
 exit-address-family
exit
!
access-list all seq 5 permit any
!
ip prefix-list denyall seq 5 deny 0.0.0.0/0 le 32
!
route-map DENYALL permit 10
 match ip address prefix-list denyall
exit
!
end

Version

root@10-1-1-1:~# vtysh

Hello, this is FRRouting (version 8.1).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

10-1-1-1# show version
FRRouting 8.1 (master-10-1-1-1).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
configured with:
    '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--localstatedir=/var/run/frr' '--sbindir=/usr/lib/frr' '--sysconfdir=/etc/frr' '--with-vtysh-pager=/usr/bin/pager' '--libdir=/usr/lib/x86_64-linux-gnu/frr' '--with-moduledir=/usr/lib/x86_64-linux-gnu/frr/modules' '--disable-dependency-tracking' '--enable-rpki' '--disable-scripting' '--with-libpam' '--enable-doc' '--enable-doc-html' '--enable-snmp' '--enable-fpm' '--disable-protobuf' '--disable-zeromq' '--enable-ospfapi' '--enable-bgp-vnc' '--enable-multipath=256' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-configfile-mask=0640' '--enable-logfile-mask=0640' 'build_alias=x86_64-linux-gnu' 'PYTHON=python3'
10-1-1-1#

How to reproduce

Steps to reproduce -

  1. Add the neighbor router configuration and establish a bgp session
  2. Ensure the bgp session is established between A and B
  3. Restart the frr service
  4. frr service restart fails.
root@10-1-1-1:~# cat a.sh
while true ; do sleep 3 ; systemctl restart frr ; systemctl status frr | grep running; if [ $? -eq 1 ]; then     exit 1; fi; done
root@10-1-1-1:~#

root@10-1-1-1:~# cat /tmp/a.log
     Active: active (running) since Thu 2024-11-21 08:54:31 UTC; 5ms ago
     Active: active (running) since Thu 2024-11-21 08:54:39 UTC; 5ms ago
     Active: active (running) since Thu 2024-11-21 08:54:48 UTC; 5ms ago
Job for frr.service failed.

Expected behavior

Steps -

  1. Add the neighbor router configuration and establish a bgp session
  2. Ensure the bgp session is established between A and B
  3. Restart the frr service
  4. frr restart should be successful

Actual behavior

Steps -

  1. Add the neighbor router configuration and establish a bgp session
  2. Ensure the bgp session is established between A and B
  3. Restart the frr service
  4. frr restart failed

Additional context

Workaround is to stop and start the frr service -

Nov 21 08:44:25 10-1-1-1 bgpd[42189]: [YAF85-253AP][EC 100663299] buffer_write: write error on fd 15: Broken pipe
Nov 21 08:44:25 10-1-1-1 bgpd[42189]: [X6B3Y-6W42R][EC 100663302] zclient_send_message: buffer_write failed to zclient fd 15, closing
Nov 21 08:44:25 10-1-1-1 zebra[42184]: [QS0NJ-H5QKJ] Zebra final shutdown
Nov 21 08:44:25 10-1-1-1 frrinit.sh[42335]:  * Stopped staticd
Nov 21 08:44:25 10-1-1-1 frrinit.sh[42336]:  * Stopped bgpd
Nov 21 08:44:25 10-1-1-1 frrinit.sh[42337]:  * Stopped zebra
Nov 21 08:44:25 10-1-1-1 systemd[1]: frr.service: Deactivated successfully.
Nov 21 08:44:25 10-1-1-1 systemd[1]: Stopped FRRouting.
Nov 21 08:44:25 10-1-1-1 systemd[1]: frr.service: Start request repeated too quickly.
Nov 21 08:44:25 10-1-1-1 systemd[1]: frr.service: Failed with result 'start-limit-hit'.
Nov 21 08:44:25 10-1-1-1 systemd[1]: Failed to start FRRouting.
Nov 21 08:44:25 10-1-1-1 systemd[1]: frr.service: Triggering OnFailure= dependencies.
Nov 21 08:44:25 10-1-1-1 systemd[1]: frr.service: Failed to enqueue OnFailure= job, ignoring: Unit heartbeat-failed@frr.service not f>
Nov 21 08:44:52 10-1-1-1 systemd[1]: frr.service: Start request repeated too quickly.
Nov 21 08:44:52 10-1-1-1 systemd[1]: frr.service: Failed with result 'start-limit-hit'.
Nov 21 08:44:52 10-1-1-1 systemd[1]: Failed to start FRRouting.
root@10-1-1-1 :~#
root@10-1-1-1 :~#
root@10-1-1-1 :~# systemctl stop frr
root@10-1-1-1 :~# systemctl start frr
root@10-1-1-1 :~# systemctl status frr
● frr.service - FRRouting
     Loaded: loaded (/lib/systemd/system/frr.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-11-21 08:50:16 UTC; 2s ago
       Docs: https://frrouting.readthedocs.io/en/latest/setup.html
    Process: 47341 ExecStart=/usr/lib/frr/frrinit.sh start (code=exited, status=0/SUCCESS)
   Main PID: 47350 (watchfrr)
     Status: "FRR Operational"
      Tasks: 13 (limit: 23695)
     Memory: 17.2M
        CPU: 435ms
     CGroup: /system.slice/frr.service
             ├─47350 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
             ├─47366 /usr/lib/frr/zebra -d -F traditional -A 127.0.0.1 -s 90000000
             ├─47372 /usr/lib/frr/bgpd -d -F traditional --daemon -A 127.0.0.1 -l 10.1.1.1
             └─47379 /usr/lib/frr/staticd -d -F traditional -A 127.0.0.1

Nov 21 08:50:12 10-1-1-1  zebra[47366]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Nov 21 08:50:12 10-1-1-1  bgpd[47372]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Nov 21 08:50:12 10-1-1-1  staticd[47379]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
Nov 21 08:50:12 10-1-1-1  watchfrr[47350]: [ZJW5C-1EHNT] restart all process 47351 exited with non-zero status 13
Nov 21 08:50:16 10-1-1-1  watchfrr[47350]: [QDG3Y-BY5TN] bgpd state -> up : connect succeeded
Nov 21 08:50:16 10-1-1-1  watchfrr[47350]: [QDG3Y-BY5TN] zebra state -> up : connect succeeded
Nov 21 08:50:16 10-1-1-1  watchfrr[47350]: [QDG3Y-BY5TN] staticd state -> up : connect succeeded
Nov 21 08:50:16 10-1-1-1  watchfrr[47350]: [KWE5Q-QNGFC] all daemons up, doing startup-complete notify
Nov 21 08:50:16 10-1-1-1 frrinit.sh[47341]:  * Started watchfrr
Nov 21 08:50:16 10-1-1-1 systemd[1]: Started FRRouting.
root@10-1-1-1:~# 

Checklist

ton31337 commented 3 days ago

Could you enable debug logging and show us the logs? debug bgp updates, debug bgp neighbor.

arjunramu commented 21 hours ago

Enabled debug logging and here are the logs -

bgpd.log frr.log journalctl_-xeu_frr_service.log