FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.21k stars 1.24k forks source link

RIP routes marked inactive and not being replaced #5174

Open seanfulton opened 4 years ago

seanfulton commented 4 years ago

We are using FRR RPM frr-7.0-01.el6.x86_64 on CENTOS 6. We've used Quagga up until about a month ago with no problems but upgraded to FRR. Since then we've noticed that machines will randomly lose their default route. When I examine the routing table, I'll see the default route marked as a RIP route but inactive.

This seems similar to: https://github.com/FRRouting/frr/issues/4535

About our network: We have two border routers running zebra. Each gets a default route via BGP and advertises it to the network using RIP. We have a static IP (#.#.#.254) that floats from router to router that non-RIP devices can use as a default GW.

When the hang occurs, I see this:

R 0.0.0.0/0 [120/2] via 10.10.2.254 inactive, 06:53:00
R>* 10.0.0.9/32 [120/2] via 10.10.2.2, bond1, 00:42:50
R>* 10.0.0.10/32 [120/2] via 10.10.2.2, bond1, 00:42:50
R>* 10.0.3.0/24 [120/2] via 10.10.2.34, bond1, 00:10:24

If I restart FRR, it immediately picks up a new default via RIP from 10.10.1.1 or 10.10.2.1, depending.

So my theory is that something causes the .254 address to flip over from say router A to router B.

My feeling is that if this .254 address becomes inactive, it should be flushed from the routing table and a new route gained from rip for either 10.10.1.1 or 10.10.1.2. Instead, the old route hangs.

Any idea why?

ripd.conf:

log file /var/log/zebra.log
!debug rip events
!debug rip zebra
!debug rip packet

!
interface bond0
ip rip split-horizon
no ip rip authentication mode
!
interface bond1
ip rip split-horizon
no ip rip authentication mode
!

router rip
version 2
timers basic 15 30 30
redistribute kernel 
no redistribute connected
no redistribute static

network 74.201.36.0/22
network 74.201.40.0/22
network 172.81.88.0/22
network 10.0.0.0/8

line vty

zebra.conf:

!
interface bond0
 ip address 10.10.1.25/24
 description "Primary LAN" 
link-detect
! ipv6 nd suppress-ra
!
interface bond1
 ip address 10.10.2.25/24
 description "Backup LAN" 
link-detect
! ipv6 nd suppress-ra
!
interface lo
!

ip forwarding

line vty
seanfulton commented 4 years ago

More info. I found that this 0.0.0.0 -> 10.10.1.254 is not coming from the router but from three of our ubuntu nodes (running FRR 7.1): nj34.onecount.net> sh ip ro Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR, f - OpenFabric,

  • selected route, * - FIB route, q - queued route, r - rejected route

K> 0.0.0.0/0 [0/0] via 10.10.1.254, primary-lan, 02w2d00h R> 10.0.0.9/32 [120/2] via 10.10.2.2, backup-lan, 20:50:45 R> 10.0.0.10/32 [120/2] via 10.10.2.2, backup-lan, 20:50:45 C> 10.10.1.0/24 is directly connected, primary-lan, 02w2d00h R> 10.10.1.254/32 [120/2] via 10.10.2.1, backup-lan, 01w3d14h C> 10.10.2.0/24 is directly connected, backup-lan, 02w2d00h R> 10.10.2.254/32 [120/2] via 10.10.1.1, primary-lan, 02:14:06 R> 10.10.4.1/32 [120/2] via 10.10.1.26, primary-lan, 04:46:44 R> 10.10.4.2/32 [120/2] via 10.10.1.27, primary-lan, 02:29:39 R> 10.10.4.3/32 [120/2] via 10.10.1.26, primary-lan, 04:46:44 R> 10.10.4.4/32 [120/2] via 10.10.1.26, primary-lan, 04:46:44 R> 10.10.4.5/32 [120/2] via 10.10.1.27, primary-lan, 02:29:39 R> 10.10.4.7/32 [120/2] via 10.10.1.25, primary-lan, 10:23:44 R> 10.10.4.8/32 [120/2] via 10.10.1.25, primary-lan, 10:23:44 R> 10.10.4.9/32 [120/2] via 10.10.1.19, primary-lan, 00:43:00 R> 10.10.4.11/32 [120/2] via 10.10.1.31, primary-lan, 1d06h33m

This comes from netplan (default routes added for each LAN segment).

So to sum up, machine 25 is getting a default route via 10.10.1.254 from machine 34 via rip. It is also getting default from 10.10.1.1 and 10.10.1.2 from BGP. Something is happening (I guess to machine 34 now) that is making the route inactive ... so why isn't RIP timing that route out and picking up the default from one of the two routers?

seanfulton commented 4 years ago

I took the default routes of netplan.yaml in nj34 and ran netplan apply. The kernel routes above stayed in the routing table. I deleted both with ip route del 0.0.0.0/0.

I then ran netstat -nr | grep 0.0.0.0 several times and watched the default route get acquired from different machines in my network. Until it stopped and there was no more default route. Curious, I logged into zebra and did a sh ip ro, and got the following: nj34.onecount.net> sh ip ro Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR, f - OpenFabric,

  • selected route, * - FIB route, q - queued route, r - rejected route

R 0.0.0.0/0 [120/2] via 10.10.2.254 inactive, 00:00:15 R> 10.0.0.9/32 [120/2] via 10.10.1.2, primary-lan, 00:05:09 R> 10.0.0.10/32 [120/2] via 10.10.1.2, primary-lan, 00:05:09 C> 10.10.1.0/24 is directly connected, primary-lan, 00:05:10 R> 10.10.1.254/32 [120/2] via 10.10.2.1, backup-lan, 00:05:09 C> 10.10.2.0/24 is directly connected, backup-lan, 00:05:10 R> 10.10.2.254/32 [120/2] via 10.10.1.1, primary-lan, 00:04:58 R> 10.10.4.1/32 [120/2] via 10.10.2.26, backup-lan, 00:05:09 R> 10.10.4.2/32 [120/2] via 10.10.1.27, primary-lan, 00:05:09 R>* 10.10.4.3/32 [120/2] via 10.10.2.26, backup-lan, 00:05:09

So even after I deleted the route manually, it is being held (long past all timers). I finally restarted frr and it picked up the default from one of the routers.

Very odd behavior.

lucize commented 4 years ago

can you reproduce it ? I think is a Zebra issue, if you can try the 4.0.1 branch and revert like https://github.com/FRRouting/frr/issues/5159#issuecomment-542542213, I didn't have time to revert and make it work for 5,6,7

qlyoung commented 4 years ago

@seanfulton can you possibly try to recreate this on a later version of Centos? We don't really do regression testing for 6 anymore given that it's more or less EOL at this point.

seanfulton commented 4 years ago

I can confirm this is happening on centos 7, frr 7.2. Same exact behavior;

seanfulton commented 4 years ago

R 0.0.0.0/0 [120/2] via 10.10.1.254 inactive, 01:30:53 R> 10.0.0.9/32 [120/2] via 10.10.2.2, bond1, 00:25:34 R> 10.0.0.10/32 [120/2] via 10.10.2.2, bond1, 00:25:34 R> 10.0.3.0/24 [120/2] via 10.10.2.34, bond1, 04:08:13 C> 10.10.1.0/24 is directly connected, bond0, 04:48:08 R> 10.10.1.254/32 [120/2] via 10.10.2.1, bond1, 04:47:55 C> 10.10.2.0/24 is directly connected, bond1, 04:48:08 R> 10.10.2.254/32 [120/2] via 10.10.1.1, bond0, 00:05:22 R> 10.10.4.1/32 [120/2] via 10.10.1.26, bond0, 04:47:57 R> 10.10.4.2/32 [120/2] via 10.10.1.27, bond0, 04:48:06 R> 10.10.4.3/32 [120/2] via 10.10.1.26, bond0, 04:47:57 R> 10.10.4.4/32 [120/2] via 10.10.1.26, bond0, 04:47:57 R> 10.10.4.5/32 [120/2] via 10.10.1.27, bond0, 04:48:06 K> 10.10.4.7/32 [0/0] is directly connected, venet0, 04:48:08 K> 10.10.4.8/32 [0/0] is directly connected, venet0, 04:48:08 R> 10.10.4.9/32 [120/2] via 10.10.1.19, bond0, 01:45:23 R> 10.10.4.11/32 [120/2] via 10.10.1.31, bond0, 04:48:06 R> 10.10.4.12/32 [120/2] via 10.10.1.4, bond0, 04:48:06 R> 10.10.4.13/32 [120/2] via 10.10.1.30, bond0, 04:47:55 K> 10.10.4.14/32 [0/0] is directly connected, venet0, 04:48:08 R> 10.10.4.15/32 [120/2] via 10.10.1.5, bond0, 04:48:06 R> 10.10.4.16/32 [120/2] via 10.10.1.6, bond0, 04:48:06 R> 10.10.4.17/32 [120/2] via 10.10.2.35, bond1, 03:36:30 R> 10.10.4.20/32 [120/2] via 10.10.1.26, bond0, 04:47:57 K> 10.10.4.21/32 [0/0] is directly connected, venet0, 04:48:08

seanfulton commented 4 years ago

What do you want me to do do here? This is becoming very problemmatic for us. Its happening on CENTOS 6, CENTOS 7 UBUNTU 18.04 on the 7.2 versions.

seanfulton commented 4 years ago

Hey guys, this is a serious issue. I'm reverting all of our nodes back to Quagga until someone figures this out. Too risky to continue in production with this. Happy to test anything any time, but this is not getting me where I need to be. This is still a problem.

sean

rzalamena commented 11 months ago

Seems related with #13561