FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.35k stars 1.25k forks source link

Routes moved from VRF stanza to default VRF #13604

Open m1waterman88 opened 1 year ago

m1waterman88 commented 1 year ago

Describe the bug

I have been -- and am still trying to -- resolve this myself. In case I can't...

When adding VRFs with ip routes, FRR sometimes moves a route outside of the VRF stanza and into the default VRF. When this happens, typically vrf-policy and vnc defaults are added at the end of the first router bgp stanza just after the last address-family block despite not being in the config.

To show the evolution of the VRF stanzas, here are some debug lines from /var/log/frr/frr-reload.log. I cut the logs down to just show the VRF stanzas.

First we see the /etc/frr/frr.conf file parsed, showing complete and proper contexts.

2023-05-26 02:11:14,666  INFO: Loading Config object from file /etc/frr/frr.conf
2023-05-26 02:11:14,758 DEBUG: LINE vrf vrf2000004                                    : enter context ['vrf vrf2000004']                                
2023-05-26 02:11:14,758 DEBUG: LINE vni 2000004                                       : add to current context ['vrf vrf2000004']                                
2023-05-26 02:11:14,758 DEBUG: LINE ip route 10.6.96.4/32 vpcnode331                  : add to current context ['vrf vrf2000004']                                
2023-05-26 02:11:14,758 DEBUG: LINE ip route 10.6.96.3/32 vpcnode338                  : add to current context ['vrf vrf2000004']                                
2023-05-26 02:11:14,758 DEBUG: LINE exit-vrf                                          : exit context ['vrf vrf2000004']                                        
2023-05-26 02:11:14,759 DEBUG: LINE vrf vrf2000005                                    : enter context ['vrf vrf2000005']                                
2023-05-26 02:11:14,759 DEBUG: LINE vni 2000005                                       : add to current context ['vrf vrf2000005']                                
2023-05-26 02:11:14,759 DEBUG: LINE ip route 10.6.112.4/32 vpcnode332                 : add to current context ['vrf vrf2000005']                                
2023-05-26 02:11:14,759 DEBUG: LINE ip route 10.6.112.3/32 vpcnode348                 : add to current context ['vrf vrf2000005']                                
2023-05-26 02:11:14,759 DEBUG: LINE exit-vrf                                          : exit context ['vrf vrf2000005']

Next is the new FRR conf with the same

2023-05-26 02:11:14,818 DEBUG: New Frr Config
vrf vrf2000004
vni 2000004
ip route 10.6.96.4/32 vpcnode331
ip route 10.6.96.3/32 vpcnode338
exit-vrf
end
!
vrf vrf2000005
vni 2000005
ip route 10.6.112.4/32 vpcnode332
ip route 10.6.112.3/32 vpcnode348
exit-vrf
end
!

Load from vtysh show running where we're first missing ip route 10.6.112.3/32 vpcnode348.

2023-05-26 02:11:14,819  INFO: Loading Config object from vtysh show running
2023-05-26 02:11:14,904 DEBUG: LINE vrf vrf2000004                                    : enter context ['vrf vrf2000004']                                
2023-05-26 02:11:14,904 DEBUG: LINE vni 2000004                                       : add to current context ['vrf vrf2000004']                                
2023-05-26 02:11:14,904 DEBUG: LINE ip route 10.6.96.3/32 vpcnode338                  : add to current context ['vrf vrf2000004']                                
2023-05-26 02:11:14,904 DEBUG: LINE ip route 10.6.96.4/32 vpcnode331                  : add to current context ['vrf vrf2000004']                                
2023-05-26 02:11:14,904 DEBUG: LINE exit-vrf                                          : exit context ['vrf vrf2000004']                                
2023-05-26 02:11:14,904 DEBUG: LINE vrf vrf2000005                                    : enter context ['vrf vrf2000005']                                
2023-05-26 02:11:14,904 DEBUG: LINE vni 2000005                                       : add to current context ['vrf vrf2000005']                                
2023-05-26 02:11:14,904 DEBUG: LINE ip route 10.6.112.4/32 vpcnode332                 : add to current context ['vrf vrf2000005']                                
2023-05-26 02:11:14,905 DEBUG: LINE exit-vrf                                          : exit context ['vrf vrf2000005']

The first pass (i.e., pass 0) with the same route missing

2023-05-26 02:11:14,915 DEBUG: Running Frr Config (Pass #0)
vrf vrf2000004
vni 2000004
ip route 10.6.96.3/32 vpcnode338
ip route 10.6.96.4/32 vpcnode331
exit-vrf
end
!
vrf vrf2000005
vni 2000005
ip route 10.6.112.4/32 vpcnode332
exit-vrf
end
!

Data passed off to a temp file for the second pass -- the missing route!

2023-05-26 02:11:15,039  INFO: /var/run/frr/reload-KQJCRF.txt content
['vrf vrf2000005\n ip route 10.6.112.3/32 vpcnode348\n',
 'vrf vrf2000005\n exit-vrf\n'']

Load from vtysh show running again where we now see the route in its own context. Additionally, vrf-policy and vnc defaults are defined in the first router bgp stanza after the last address-family block.

2023-05-26 02:11:15,113  INFO: Loading Config object from vtysh show running
2023-05-26 02:11:15,199 DEBUG: LINE ip route 10.6.112.3/32 vpcnode348                 : single-line context
2023-05-26 02:11:15,199 DEBUG: LINE vrf vrf2000004                                    : enter context ['vrf vrf2000004']                                
2023-05-26 02:11:15,199 DEBUG: LINE vni 2000004                                       : add to current context ['vrf vrf2000004']                                
2023-05-26 02:11:15,199 DEBUG: LINE ip route 10.6.96.3/32 vpcnode338                  : add to current context ['vrf vrf2000004']                                
2023-05-26 02:11:15,199 DEBUG: LINE ip route 10.6.96.4/32 vpcnode331                  : add to current context ['vrf vrf2000004']                                
2023-05-26 02:11:15,199 DEBUG: LINE exit-vrf                                          : exit context ['vrf vrf2000004']                                
2023-05-26 02:11:15,199 DEBUG: LINE vrf vrf2000005                                    : enter context ['vrf vrf2000005']                                
2023-05-26 02:11:15,200 DEBUG: LINE vni 2000005                                       : add to current context ['vrf vrf2000005']                                
2023-05-26 02:11:15,200 DEBUG: LINE ip route 10.6.112.4/32 vpcnode332                 : add to current context ['vrf vrf2000005']                                
2023-05-26 02:11:15,200 DEBUG: LINE exit-vrf                                          : exit context ['vrf vrf2000005']
...
2023-05-26 02:11:15,202 DEBUG: LINE vrf-policy vrf2000005                             : enter context ['router bgp REDACTED', 'vrf-policy vrf2000005']     
2023-05-26 02:11:15,203 DEBUG: LINE exit-vrf-policy                                   : exit context ['router bgp REDACTED', 'vrf-policy vrf2000005']     
2023-05-26 02:11:15,203 DEBUG: LINE vnc defaults                                      : enter context ['router bgp REDACTED', 'vnc defaults']              
2023-05-26 02:11:15,203 DEBUG: LINE exit-vnc                                          : exit context ['router bgp REDACTED', 'vnc defaults']

The second pass (i.e., pass 1) with the route in the default VRF

ip route 10.6.112.3/32 vpcnode348
!
vrf vrf2000004
vni 2000004
ip route 10.6.96.3/32 vpcnode338
ip route 10.6.96.4/32 vpcnode331
exit-vrf
end
!
vrf vrf2000005
vni 2000005
ip route 10.6.112.4/32 vpcnode332
exit-vrf
end
!

Before continuing, note I added some of my own debugging and found the ip route command is hitting this condition: if x == 1 and ctx_keys[0].startswith("no "):. I'm not sure why, but ctx_keys[0] is showing as no ip route 10.6.112.3/32 vpcnode348 there. That leads us here, where more data is passed to a new file, including the route. (Also note that everything in the array was present twice this time for some reason.)

2023-05-26 02:11:15,211  INFO: /var/run/frr/reload-771H43.txt content
['vrf vrf2000005\n ip route 10.6.112.3/32 vpcnode348\n',
 'vrf vrf2000005\n exit-vrf\n']

To Reproduce

I've been testing a new product with 1-2 VRFs, each with 1-2 interfaces attached. Reproduction can be oddly sporadic, but I find it happens most often when I have two VRFs defined and each has two interfaces attached. When a new interface is added to a VRF, /etc/frr/frr.conf is updated and a reload is performed: systemctl reload frr. When I check the running config with vtysh -c "show run", I can see the route issue. When this doesn't happen, all network activity is as expected; when it does, of course, there are problems.

Expected behavior

When reloading FRR, ip routes should stay in the VRF stanza and not move to the default VRF, and we shouldn't add VRF information into router bgp stanzas.

Screenshots

Versions

Additional context

Likely unrelated, but in my review, I found prior_ctx_key is a local variable in check_for_exit_vrf which is used before it's defined. I initialized it before the loop to None, but it didn't really seem to make a difference for my issue.

m1waterman88 commented 1 year ago

Is there anything needed to help troubleshoot this?

m1waterman88 commented 11 months ago

I originally closed this since I thought it was resolved by something on our side, but I seemingly just found a way to make it more stable. I haven't been able to get a reply in the FRRouting Slack server, so I reopened this issue. Can someone review, please? I can provide more information as needed.

From the FRR Slack message I sent:

We've been having sporadic trouble with FRR reloads since we released a product months ago. Sometimes we don't see it much and other times we see it quite a bit. We've reviewed configs, tested various versions of FRR (including the most recent release, 9.0.1), and we still have reload failures. To be clear, any time we start/restart FRR, the configs appear as expected and all is well, but when we try to reload to keep the service up, many times we get misconfigurations. Usually we see some static routes in the default VRF instead of the particular VRF they belong to or vrf-policy and vnc defaults lines added to a router bgp <asn> stanza despite not using those in any of our configs. The misconfigurations may keep the service up with network connectivity problems after a reload, but a subsequent reload will cause FRR to restart, at which point the configs are correct at the cost of some downtime. Evidence leads us to believe the frr-reload.py script is the problem. Is there anyone who could assist in tracking down the issue?