When a vrf is deleted from the kernel, before its removed from the FRR config, zebra gets to delete the the vrf and assiciated state.
It does so by sending a request to delete the l3 vni associated with the vrf followed by a request to delete the vrf itself.
The zebra client communication is asynchronous and about 1/5 cases the bgp client process them in a different order.
2023/10/06 06:22:18 BGP: [VP18N-HB5R6] VRF testVRF1001(766) is to be deleted.
2023/10/06 06:22:18 BGP: [RH4KQ-X3CYT] VRF testVRF1001(766) is to be disabled.
2023/10/06 06:22:18 BGP: [X8ZE0-9TS5H] VRF disable testVRF1001 id 766
2023/10/06 06:22:18 BGP: [X67AQ-923PR] Deregistering VRF 766
2023/10/06 06:22:18 BGP: [K52W0-YZ4T8] VRF Deletion: testVRF1001(4294967295)
.. and a bit later :
2023/10/06 06:22:18 BGP: [MRXGD-9MHNX] DJERNAES: process L3VNI 1001 DEL
2023/10/06 06:22:18 BGP: [NCEPE-BKB1G][EC 33554467] Cannot process L3VNI 1001 Del - Could not find BGP instance
When the bgp vrf config is removed later it fails on the sanity check if l3vni is removed.
The solution is to make bgp cleanup the l3vni a bgp instance is going down.
The fix:
The fix is to add a function in bgp_evpn.c to be responsible for for deleting the local vni, if it should be needed, and call the function from bgp_instance_down().
Testing:
Currently we have issues with reproduction of this bug. Before we had a test, that run in container lab that removes the vrf on the host before removing the vrf and the bgp config form frr. Running this test in a loop triggered the problem 18 times of 100 runs. After the fix it did not fail.
To verify the fix a log message (which is not in the code any longer) were used when we had a stale l3vni and needed to call bgp_evpn_local_l3vni_del() to do the cleanup. This were hit 20 times in 100 test runs.
This is an automatic backport of pull request #16059 done by Mergify.
Problem statement:
When a vrf is deleted from the kernel, before its removed from the FRR config, zebra gets to delete the the vrf and assiciated state. It does so by sending a request to delete the l3 vni associated with the vrf followed by a request to delete the vrf itself.
The zebra client communication is asynchronous and about 1/5 cases the bgp client process them in a different order.
.. and a bit later :
When the bgp vrf config is removed later it fails on the sanity check if l3vni is removed.
Solution:
The solution is to make bgp cleanup the l3vni a bgp instance is going down.
The fix:
The fix is to add a function in
bgp_evpn.c
to be responsible for for deleting the local vni, if it should be needed, and call the function frombgp_instance_down()
.Testing:
Currently we have issues with reproduction of this bug. Before we had a test, that run in container lab that removes the vrf on the host before removing the vrf and the bgp config form frr. Running this test in a loop triggered the problem 18 times of 100 runs. After the fix it did not fail.
To verify the fix a log message (which is not in the code any longer) were used when we had a stale l3vni and needed to call
bgp_evpn_local_l3vni_del()
to do the cleanup. This were hit 20 times in 100 test runs.This is an automatic backport of pull request #16059 done by Mergify.