Open m1waterman88 opened 1 year ago
Is there anything needed to help troubleshoot this?
I originally closed this since I thought it was resolved by something on our side, but I seemingly just found a way to make it more stable. I haven't been able to get a reply in the FRRouting Slack server, so I reopened this issue. Can someone review, please? I can provide more information as needed.
From the FRR Slack message I sent:
We've been having sporadic trouble with FRR reloads since we released a product months ago. Sometimes we don't see it much and other times we see it quite a bit. We've reviewed configs, tested various versions of FRR (including the most recent release, 9.0.1), and we still have reload failures. To be clear, any time we start/restart FRR, the configs appear as expected and all is well, but when we try to reload to keep the service up, many times we get misconfigurations. Usually we see some static routes in the default VRF instead of the particular VRF they belong to or vrf-policy
and vnc defaults
lines added to a router bgp <asn>
stanza despite not using those in any of our configs. The misconfigurations may keep the service up with network connectivity problems after a reload, but a subsequent reload will cause FRR to restart, at which point the configs are correct at the cost of some downtime. Evidence leads us to believe the frr-reload.py
script is the problem. Is there anyone who could assist in tracking down the issue?
Describe the bug
I have been -- and am still trying to -- resolve this myself. In case I can't...
When adding VRFs with
ip route
s, FRR sometimes moves a route outside of the VRF stanza and into the default VRF. When this happens, typicallyvrf-policy
andvnc defaults
are added at the end of the firstrouter bgp
stanza just after the lastaddress-family
block despite not being in the config.To show the evolution of the VRF stanzas, here are some debug lines from
/var/log/frr/frr-reload.log
. I cut the logs down to just show the VRF stanzas.First we see the
/etc/frr/frr.conf
file parsed, showing complete and proper contexts.Next is the new FRR conf with the same
Load from
vtysh show running
where we're first missingip route 10.6.112.3/32 vpcnode348
.The first pass (i.e., pass 0) with the same route missing
Data passed off to a temp file for the second pass -- the missing route!
Load from
vtysh show running
again where we now see the route in its own context. Additionally,vrf-policy
andvnc defaults
are defined in the firstrouter bgp
stanza after the lastaddress-family
block.The second pass (i.e., pass 1) with the route in the default VRF
Before continuing, note I added some of my own debugging and found the
ip route
command is hitting this condition:if x == 1 and ctx_keys[0].startswith("no "):
. I'm not sure why, butctx_keys[0]
is showing asno ip route 10.6.112.3/32 vpcnode348
there. That leads us here, where more data is passed to a new file, including the route. (Also note that everything in the array was present twice this time for some reason.)To Reproduce
I've been testing a new product with 1-2 VRFs, each with 1-2 interfaces attached. Reproduction can be oddly sporadic, but I find it happens most often when I have two VRFs defined and each has two interfaces attached. When a new interface is added to a VRF,
/etc/frr/frr.conf
is updated and a reload is performed:systemctl reload frr
. When I check the running config withvtysh -c "show run"
, I can see the route issue. When this doesn't happen, all network activity is as expected; when it does, of course, there are problems.Expected behavior
When reloading FRR,
ip route
s should stay in the VRF stanza and not move to the default VRF, and we shouldn't add VRF information intorouter bgp
stanzas.Screenshots
Versions
frr-reload.py
file)Additional context
Likely unrelated, but in my review, I found
prior_ctx_key
is a local variable incheck_for_exit_vrf
which is used before it's defined. I initialized it before the loop toNone
, but it didn't really seem to make a difference for my issue.