FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.33k stars 1.25k forks source link

bfdd crash on startup sometimes #10221

Closed tufeigunchu closed 2 years ago

tufeigunchu commented 2 years ago

I am running docker version of FRR 8.0.1.
enabled: bgpd staticd bfdd startup log:

router bgp 3200 router bgp 3800 bgp router-id 132.15.61.61 bgp router-id 132.15.61.63
2021/12/15 01:07:49 WATCHFRR: [T83RR-8SM5G] watchfrr 8.0.1_git starting: vty@0
2021/12/15 01:07:49 WATCHFRR: [ZCJ3S-SPH5S] zebra state -> down : initial connection attempt failed
2021/12/15 01:07:49 WATCHFRR: [ZCJ3S-SPH5S] bgpd state -> down : initial connection attempt failed
2021/12/15 01:07:49 WATCHFRR: [ZCJ3S-SPH5S] staticd state -> down : initial connection attempt failed
2021/12/15 01:07:49 WATCHFRR: [ZCJ3S-SPH5S] bfdd state -> down : initial connection attempt failed
2021/12/15 01:07:49 WATCHFRR: [YFT0P-5Q5YX] Forked background command [pid 40]: /usr/lib/frr/watchfrr.sh restart all
Cannot stop bfdd: pid file not found
Cannot stop bgpd: pid file not found
Cannot stop zebra: pid file not found
Cannot stop staticd: pid file not found
2021/12/15 01:07:51 ZEBRA: [NNACN-54BDA][EC 4043309110] Disabling MPLS support (no kernel support)
2021/12/15 01:07:54 WATCHFRR: [QDG3Y-BY5TN] zebra state -> up : connect succeeded
2021/12/15 01:07:54 WATCHFRR: [YFT0P-5Q5YX] Forked background command [pid 70]: /usr/lib/frr/watchfrr.sh restart staticd
2021/12/15 01:07:54 WATCHFRR: [YFT0P-5Q5YX] Forked background command [pid 73]: /usr/lib/frr/watchfrr.sh restart bfdd
Cannot stop staticd: pid file not found
Cannot stop bfdd: pid file not found
2021/12/15 01:07:56 STATIC: [NTCMZ-C8X80][EC 100663303] Could not lock pid_file /var/run/frr/staticd.pid (Resource temporarily unavailable), exiting.  Please ensure that the daemon is not already running
2021/12/15 01:07:58 BFD: [JYK95-PZFVY] bind-ip: bind: Address in use
2021/12/15 01:07:58 BFD: bfdd/bfd_packet.c:919: bp_bind_ip(): assertion (!"bind-ip: bind: %s") failed
BFD: Received signal 6 at 1639530478 (si_addr 0x6400000066, PC 0x7faec1c2a3f0); aborting...
BFD: no thread information available
core_handler: showing active allocations in memory group libfrr
core_handler: memstats:  Host config                   :      2 * (variably sized)
core_handler: memstats:  Command Argument Name         :    139 * (variably sized)
core_handler: memstats:  Command Token Help            :    588 * (variably sized)
core_handler: memstats:  Command Token Text            :    588 * (variably sized)
core_handler: memstats:  Command Tokens                :    744 *         72
core_handler: memstats:  Graph Node                    :    896 *         32
core_handler: memstats:  Graph                         :      8 *          8
core_handler: memstats:  Hash Index                    :     13 * (variably sized)
core_handler: memstats:  Hash Bucket                   :    208 *         32
core_handler: memstats:  Hash                          :     26 * (variably sized)
core_handler: memstats:  Link Node                     :     17 *         24
core_handler: memstats:  Link List                     :      5 *         40
core_handler: memstats:  Northbound Configuration      :      2 *         16
core_handler: memstats:  Northbound Node               :    159 *       1192
core_handler: memstats:  Privilege information         :      3 * (variably sized)
core_handler: memstats:  Thread Poll Info              :      2 *    8388608
core_handler: memstats:  Thread master                 :      4 * (variably sized)
core_handler: memstats:  Typed-hash bucket             :      1 *         16
core_handler: memstats:  Vector index                  :   1812 * (variably sized)
core_handler: memstats:  Vector                        :   1812 *         16
core_handler: memstats:  VRF                           :      1 *        216
core_handler: memstats:  YANG module                   :      4 *         48
core_handler: showing active allocations in memory group logging subsystem
core_handler: showing active allocations in memory group Bidirectional Forwarding Detection Daemon
core_handler: memstats:  BFD VRF                       :      1 *         80
/usr/lib/frr/frrcommon.sh: line 170:   102 Aborted                 /usr/lib/frr/bfdd -d -F traditional
Failed to start bfdd!
2021/12/15 01:07:59 WATCHFRR: [QDG3Y-BY5TN] bgpd state -> up : connect succeeded
2021/12/15 01:07:59 WATCHFRR: [QDG3Y-BY5TN] staticd state -> up : connect succeeded
2021/12/15 01:07:59 WATCHFRR: [QDG3Y-BY5TN] bfdd state -> up : connect succeeded
2021/12/15 01:07:59 WATCHFRR: [KWE5Q-QNGFC] all daemons up, doing startup-complete notify
% Can't open configuration file /etc/frr/vtysh.conf due to 'No such file or directory'.
2021/12/15 01:08:09 WATCHFRR: [ZE9RA-19PS5] restart all child process 40 still running after 20 seconds, sending signal 15
2021/12/15 01:08:09 WATCHFRR: [SK7QP-A2GT9] restart all process 40 terminated due to signal 15
2021/12/15 01:08:09 WATCHFRR: [RKHTV-CNGEG] Daemon: staticd: is in Up state but expected it to be in DAEMON_DOWN state
% Can't open configuration file /etc/frr/vtysh.conf due to 'No such file or directory'.
2021/12/15 01:08:11 WATCHFRR: [RKHTV-CNGEG] Daemon: bfdd: is in Up state but expected it to be in DAEMON_DOWN state

frr.conf:

log file /tmp/bgpd.log
debug bgp nht
debug bgp updates in
debug bgp updates out
debug bgp zebra
ip route 16.15.15.132/32 162.15.15.253 table 1
ip route 16.15.16.132/32 162.15.16.253 table 2
ipv6 route 16:15:15::132/128 162:15:15::253 table 1
ipv6 route 16:15:16::132/128 162:15:16::253 table 2
ip route 16.15.15.132/32 162.15.15.253
ip route 16.15.16.132/32 162.15.16.253
ipv6 route 16:15:15::132/128 162:15:15::253
ipv6 route 16:15:16::132/128 162:15:16::253
!
route-map access-in permit 10
 set table 1
!
route-map core-in permit 10
 set table 2
!
route-map access-out permit 10
!
route-map core-out permit 10
!
router bgp 3800
bgp router-id 132.15.61.63
 no bgp ebgp-requires-policy
 no bgp network import-check
 neighbor 16.15.15.132 remote-as 3000
 neighbor 16.15.15.132 ebgp-multihop 255
 neighbor 16.15.15.132 update-source lo-loopback-10
 neighbor 16.15.16.132 remote-as 4000
 neighbor 16.15.16.132 ebgp-multihop 255
 neighbor 16.15.16.132 update-source lo-loopback-11
 neighbor 16:15:15::132 remote-as 3000
 neighbor 16:15:15::132 ebgp-multihop 255
 neighbor 16:15:15::132 update-source lo-loopback-10
 neighbor 16:15:16::132 remote-as 4000
 neighbor 16:15:16::132 ebgp-multihop 255
 neighbor 16:15:16::132 update-source lo-loopback-11
 !
 address-family ipv4 unicast
  network 15.1.0.0/16
  redistribute connected
  neighbor 16.15.15.132 distribute-list access out
  neighbor 16.15.15.132 route-map access-in in
  neighbor 16.15.15.132 route-map access-out out
  neighbor 16.15.16.132 distribute-list core out
  neighbor 16.15.16.132 route-map core-in in
  neighbor 16.15.16.132 route-map core-out out
 exit-address-family
 !
 address-family ipv6 unicast
  network 1534:1111:1111::/64
  redistribute connected
  neighbor 16:15:15::132 activate
  neighbor 16:15:15::132 distribute-list access out
  neighbor 16:15:15::132 route-map access-in in
  neighbor 16:15:15::132 route-map access-out out
  neighbor 16:15:16::132 activate
  neighbor 16:15:16::132 distribute-list core out
  neighbor 16:15:16::132 route-map core-in in
  neighbor 16:15:16::132 route-map core-out out
 exit-address-family
!
access-list access seq 5 permit any
access-list core seq 5 permit any
!
ipv6 access-list access seq 5 permit any
ipv6 access-list core seq 5 permit any
!

Not happen everytime. I can't figure out the cause.

idryzhov commented 2 years ago

The cause is right there in your logs:

2021/12/15 01:07:58 BFD: [JYK95-PZFVY] bind-ip: bind: Address in use

It doesn't crash. It can't start because the address/port is already in use. Looks like you already have running bfdd on your system sometimes.

@frrbot autoclose in 1 day

frrbot[bot] commented 2 years ago

This issue will be automatically closed in the specified period unless there is further activity.

tufeigunchu commented 2 years ago

That's strange. docker container is in different namespace. No exist bfdd can affect it I think

frrbot[bot] commented 2 years ago

This issue will no longer be automatically closed.

idryzhov commented 2 years ago

I'm not sure why this is happening on your system, but this is definitely it. staticd warns about the same thing:

2021/12/15 01:07:56 STATIC: [NTCMZ-C8X80][EC 100663303] Could not lock pid_file /var/run/frr/staticd.pid (Resource temporarily unavailable), exiting.  Please ensure that the daemon is not already running

You need to find the reason why staticd and bfdd are not properly stopped before starting a new instance.

tufeigunchu commented 2 years ago

Since it can restart successfully, I close this issue now