FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.12k stars 1.2k forks source link

zebra: evpn: not coerce VTEP IP to IPv4 in nh_list #16341

Open crosser opened 4 days ago

crosser commented 4 days ago

In L3 BGP-EVPN, if there are both IPv4 and IPv6 routes in the VPN, zebra maintains two instances of struct zebra_neigh object: one with IPv4 address of the nexthop, and another with IPv6 address that is an IPv4 mapped to IPv6, but only one intance of struct zebra_mac object, that contains a list of nexthop addresses that use this mac.

The code in zebra_vxlan module uses the fact that the list is empty as the indication that the zebra_mac object is unused, and needs to be dropped. However, preexisting code used nexthop address converted to IPv4 notation for the element of this list. As a result, when two zebra_neigh objects, one IPv4 and one IPv6-mapped-IPv4 were linked to the zebra_mac object, only one element was added to the list. Consequently, when one of the two zebra_neigh objects was dropped, the only element in the list was removed, making it empty, and zebra_mac object was dropped, and neigbrour cache elements uninstalled from the kernel.

As a result, after the last route in one family was removed from a remote vtep, all remaining routes in the other family became unreachable, because RMAC of the vtep was removed.

This commit makes zebra_mac use uncoerced IP address of the zebra_neigh object for the entries in the nh_list. This way, zebra_mac object no longer loses track of zebra_neigh objects that need it.

Bug-ID: 16340

crosser commented 3 days ago

I don't understand what does this failure mean

 Jobs Failure: 1/49

    TopoTests Ubuntu 22.04 amd64 Part 9 -> https://ci1.netdef.org/browse/FRR-PULLREQ3-TOPO9U2204AMD64-4041
    🚫 ospf_metric_propagation.test_ospf_metric_propagation test_link_1_4_down

It is just on one architecture? Is it related to the change?...

ton31337 commented 3 days ago

I don't understand what does this failure mean

 Jobs Failure: 1/49

    TopoTests Ubuntu 22.04 amd64 Part 9 -> https://ci1.netdef.org/browse/FRR-PULLREQ3-TOPO9U2204AMD64-4041
    🚫 ospf_metric_propagation.test_ospf_metric_propagation test_link_1_4_down

It is just on one architecture? Is it related to the change?...

Ignore it, I rerun failed tests only.

ton31337 commented 3 days ago

But still, could you fix frrbot check (styling)?

crosser commented 3 days ago

But still, could you fix frrbot check (styling)?

Done.

I did not do it intially because it asked to change some adjacent code that my change did not touch. Did now.

crosser commented 2 days ago

LGTM, but IMO this should also be covered by a topotest, what do you think?

I agree that it would be useful, but as I am not acquainted with the testing framework that this project is using, I am not sure that I will be able to do it in reasonable time...

leonshaw commented 1 day ago

Is this the same case encountered in #14808?

ton31337 commented 15 hours ago

Is this the same case encountered in #14808?

Looks like, and there is a topotest already. Anyway, it would be great if @chiragshah6 could help here reviewing.

crosser commented 11 hours ago

Is this the same case encountered in #14808?

Looks like so! And such a shame that I did not find it and had to redo all the work myself.

But even bigger shame that your PR was not merged back then. Would have saved us quite some trouble that we had in production...

edit: @leonshaw , do you think that my change looks a bit cleaner? But, I am no judge, you guys decide which to take! The main thing is to have the bug fixed.

leonshaw commented 3 hours ago

edit: @leonshaw , do you think that my change looks a bit cleaner?

I think you are correct. It's cleaner to have both IPv4/IPv6 nexthops explicitly. BTW, you can cherry-pick the topotest part if you like.