FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.13k stars 1.21k forks source link

bgpd: fixed failing to remove VRF if there is a stale l3vni (backport #16059) #16255

Closed mergify[bot] closed 3 weeks ago

mergify[bot] commented 4 weeks ago

Problem statement:

When a vrf is deleted from the kernel, before its removed from the FRR config, zebra gets to delete the the vrf and assiciated state. It does so by sending a request to delete the l3 vni associated with the vrf followed by a request to delete the vrf itself.

2023/10/06 06:22:18 ZEBRA: [JAESH-BABB8] Send L3_VNI_DEL 1001 VRF testVRF1001 to bgp
2023/10/06 06:22:18 ZEBRA: [XC3P3-1DG4D] MESSAGE: ZEBRA_VRF_DELETE testVRF1001

The zebra client communication is asynchronous and about 1/5 cases the bgp client process them in a different order.

2023/10/06 06:22:18 BGP: [VP18N-HB5R6] VRF testVRF1001(766) is to be deleted.
2023/10/06 06:22:18 BGP: [RH4KQ-X3CYT] VRF testVRF1001(766) is to be disabled.
2023/10/06 06:22:18 BGP: [X8ZE0-9TS5H] VRF disable testVRF1001 id 766
2023/10/06 06:22:18 BGP: [X67AQ-923PR] Deregistering VRF 766
2023/10/06 06:22:18 BGP: [K52W0-YZ4T8] VRF Deletion: testVRF1001(4294967295)

.. and a bit later :

2023/10/06 06:22:18 BGP: [MRXGD-9MHNX] DJERNAES: process L3VNI 1001 DEL
2023/10/06 06:22:18 BGP: [NCEPE-BKB1G][EC 33554467] Cannot process L3VNI 1001 Del - Could not find BGP instance

When the bgp vrf config is removed later it fails on the sanity check if l3vni is removed.

    if (bgp->l3vni) {
        vty_out(vty, "%% Please unconfigure l3vni %u\n",
            bgp->l3vni);
        return CMD_WARNING_CONFIG_FAILED;
    }

Solution:

The solution is to make bgp cleanup the l3vni a bgp instance is going down.

The fix:

The fix is to add a function in bgp_evpn.c to be responsible for for deleting the local vni, if it should be needed, and call the function from bgp_instance_down().

Testing:

Currently we have issues with reproduction of this bug. Before we had a test, that run in container lab that removes the vrf on the host before removing the vrf and the bgp config form frr. Running this test in a loop triggered the problem 18 times of 100 runs. After the fix it did not fail.

To verify the fix a log message (which is not in the code any longer) were used when we had a stale l3vni and needed to call bgp_evpn_local_l3vni_del() to do the cleanup. This were hit 20 times in 100 test runs.


This is an automatic backport of pull request #16059 done by Mergify.