FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.33k stars 1.25k forks source link

frr 5.0: evpn and vrf not working anymore #2460

Closed aderumier closed 3 years ago

aderumier commented 6 years ago

Hi, I can't exchange evpn routes in vrf anymore since 5.0. It was working fine last month in 4.1-dev. (don't remember exactly when).

with this simple config:

vrf vrf1
 vni 4001
!
router bgp 1234
 bgp router-id 10.59.100.231
 no bgp default ipv4-unicast
 coalesce-time 1000
 neighbor 10.59.100.232 remote-as 1234
 !
 address-family l2vpn evpn
  neighbor 10.59.100.232 activate
  advertise-all-vni
 exit-address-family
!
router bgp 1234 vrf vrf1
!
 bgp router-id 10.59.100.231
 !
 address-family ipv4 unicast
  redistribute connected
 exit-address-family
 !
 address-family l2vpn evpn
  advertise ipv4 unicast
 exit-address-family
!

frr 4.1-dev (around last month)

# show bgp l2vpn evpn summary
BGP router identifier 10.59.100.231, local AS number 1234 vrf-id 0
BGP table version 0
RIB entries 11, using 1672 bytes of memory
Peers 1, using 20 KiB of memory

Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
10.59.100.232   4       1234      14      14        0    0    0 00:06:19            8

Total number of neighbors 1

frr5.0 (stable branch or 5.0 tag)

# show bgp l2vpn evpn summary
BGP router identifier 10.59.100.231, local AS number 1234 vrf-id 0
BGP table version 0
RIB entries 11, using 1672 bytes of memory
Peers 1, using 20 KiB of memory

Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
10.59.100.232   4       1234      10      10        0    0    0 00:00:14       Active

Total number of neighbors 1

"show bgp evpn route" only display local routes, but don't see routes from neighbor.

This only happen inside vrf, without vrf it's working fine.

vjardin commented 6 years ago

Alexandre: can you enable BGP debug logs and report them ? Can you run the same commands without summary ?

Any other informations are welcomed too: zebra, kernel, etc... even if it does not really mater here.

aderumier commented 6 years ago

here the debug logs

frr 4.1.log; https://gist.github.com/aderumier/f916b21c803f0de5c0283ed0d3375b56

frr 5.0.log; https://gist.github.com/aderumier/dbf3d0760fccb9b661057869086ab739

5.0

# show bgp l2vpn evpn
Route Distinguisher: ip 10.59.100.231:2

*> [5]:[0]:[24]:[172.16.0.0]
                    10.59.100.231            0         32768 ?
*> [5]:[0]:[24]:[192.168.0.0]
                    10.59.100.231            0         32768 ?
*> [5]:[0]:[24]:[192.168.1.0]
                    10.59.100.231            0         32768 ?
Route Distinguisher: ip 10.59.100.231:3

*> [2]:[0]:[48]:[b2:80:29:35:0e:1c]
                    10.59.100.231                      32768 i
*> [2]:[0]:[48]:[b2:80:29:35:0e:1c]:[32]:[192.168.0.10]
                    10.59.100.231                      32768 i
*> [2]:[0]:[48]:[b2:80:29:35:0e:1c]:[128]:[fe80::b080:29ff:fe35:e1c]
                    10.59.100.231                      32768 i
*> [3]:[0]:[32]:[10.59.100.231]
                    10.59.100.231                      32768 i
Route Distinguisher: ip 10.59.100.231:4

*> [3]:[0]:[32]:[10.59.100.231]
                    10.59.100.231                      32768 i

Displayed 8 out of 8 total prefixes

4.1

# show bgp l2vpn evpn
Route Distinguisher: ip 10.59.100.231:2

*> [5]:[0]:[24]:[172.16.0.0]
                    10.59.100.231            0         32768 ?
*> [5]:[0]:[24]:[192.168.0.0]
                    10.59.100.231            0         32768 ?
*> [5]:[0]:[24]:[192.168.1.0]
                    10.59.100.231            0         32768 ?
Route Distinguisher: ip 10.59.100.231:3

*> [2]:[0]:[48]:[b2:80:29:35:0e:1c]
                    10.59.100.231                      32768 i
*> [2]:[0]:[48]:[b2:80:29:35:0e:1c]:[32]:[192.168.0.10]
                    10.59.100.231                      32768 i
*> [2]:[0]:[48]:[b2:80:29:35:0e:1c]:[128]:[fe80::b080:29ff:fe35:e1c]
                    10.59.100.231                      32768 i
*> [3]:[0]:[32]:[10.59.100.231]
                    10.59.100.231                      32768 i
Route Distinguisher: ip 10.59.100.231:4

*> [3]:[0]:[32]:[10.59.100.231]
                    10.59.100.231                      32768 i
Route Distinguisher: ip 10.59.100.232:2

*>i[5]:[0]:[24]:[172.16.0.0]
                    10.59.100.232            0    100      0 ?
*>i[5]:[0]:[24]:[192.168.0.0]
                    10.59.100.232            0    100      0 ?
*>i[5]:[0]:[24]:[192.168.1.0]
                    10.59.100.232            0    100      0 ?
Route Distinguisher: ip 10.59.100.232:3

*>i[3]:[0]:[32]:[10.59.100.232]
                    10.59.100.232                 100      0 i
Route Distinguisher: ip 10.59.100.232:4

*>i[2]:[0]:[48]:[b2:66:43:60:b7:50]
                    10.59.100.232                 100      0 i
*>i[2]:[0]:[48]:[b2:66:43:60:b7:50]:[32]:[192.168.1.11]
                    10.59.100.232                 100      0 i
*>i[2]:[0]:[48]:[b2:66:43:60:b7:50]:[128]:[fe80::b066:43ff:fe60:b750]
                    10.59.100.232                 100      0 i
*>i[3]:[0]:[32]:[10.59.100.232]
                    10.59.100.232                 100      0 i

Displayed 16 out of 16 total prefixes

os is a debian 9.0 with kernel 4.15,

sysctl tuning:

net.ipv4.tcp_l3mdev_accept=1 net.ipv4.conf.default.rp_filter=0 net.ipv4.conf.all.rp_filter=0 net.ipv4.ip_forward=1 net.ipv6.conf.all.forwarding=1

testing with 2 hosts peering together

host1 : /etc/network/interfaces

auto eno1.100
iface eno1.100
        address  10.59.100.231
        netmask  255.255.255.0
        gateway  10.59.100.1

auto eno2.100
iface eno2.100
        address  172.16.0.1
        netmask  255.255.255.0
        vrf vrf1

auto vmbr2
iface vmbr2
        address  192.168.0.1/24
        bridge_ports vxlan2
        bridge_stp off
        bridge_fd 0
        hwaddress 44:39:39:FF:40:94
        vrf vrf1

auto vxlan3
iface vxlan3 inet manual
        vxlan-id 3
        vxlan-local-tunnelip 10.59.100.231
        bridge-learning off
        bridge-arp-nd-suppress on
        bridge-unicast-flood off
        bridge-multicast-flood off

auto vmbr3
iface vmbr3
        address  192.168.1.1/24
        bridge_ports vxlan3
        bridge_stp off
        bridge_fd 0
        hwaddress 44:39:39:FF:40:94
        vrf vrf1

#interconnect vxlan-vfr l3vni
auto vxlan4001
iface vxlan4001
        vxlan-id 4001
        vxlan-local-tunnelip 10.59.100.231
        bridge-learning off
        bridge-arp-nd-suppress on
        bridge-unicast-flood off
        bridge-multicast-flood off

auto vmbr4001
iface vmbr4001
        bridge_ports vxlan4001
        bridge_stp off
        bridge_fd 0
        hwaddress 44:39:39:FF:40:90
        vrf vrf1

auto vrf1
iface vrf1
    vrf-table auto

host2 : /etc/network/interfaces

auto eno1.100
iface eno1.100
        address  10.59.100.232
        netmask  255.255.255.0
        gateway  10.59.100.1

auto eno2.100
iface eno2.100
        address  172.16.0.2
        netmask  255.255.255.0
        vrf vrf1

auto vxlan2
iface vxlan2 inet manual
        vxlan-id 2
        vxlan-local-tunnelip 10.59.100.232
        bridge-learning off
        bridge-arp-nd-suppress on
        bridge-unicast-flood off
        bridge-multicast-flood off

auto vmbr2
iface vmbr2
        address  192.168.0.1/24
        bridge_ports vxlan2
        bridge_stp off
        bridge_fd 0
        hwaddress 44:39:39:FF:40:94
        vrf vrf1

auto vxlan3
iface vxlan3 inet manual
        vxlan-id 3
        vxlan-local-tunnelip 10.59.100.232
        bridge-learning off
        bridge-arp-nd-suppress on
        bridge-unicast-flood off
        bridge-multicast-flood off

auto vmbr3
iface vmbr3
        address  192.168.1.1/24
        bridge_ports vxlan3
        bridge_stp off
        bridge_fd 0
        hwaddress 44:39:39:FF:40:94
        vrf vrf1

#interconnect vxlan-vfr l3vni
auto vxlan4001
iface vxlan4001
        vxlan-id 4001
        vxlan-local-tunnelip 10.59.100.232
        bridge-learning off
        bridge-arp-nd-suppress on
        bridge-unicast-flood off
        bridge-multicast-flood off

auto vmbr4001
iface vmbr4001
        bridge_ports vxlan4001
        bridge_stp off
        bridge_fd 0
        hwaddress 44:39:39:FF:40:91
        vrf vrf1

auto vrf1
iface vrf1
    vrf-table auto
aderumier commented 6 years ago

I have tested the frr-5.0-dev branch, and it's working fine. I'll try to bisect, but it seem to be recent.

aderumier commented 6 years ago

I have find the commit: https://github.com/FRRouting/frr/commit/7e0c80ea1c526903d4b67dabddc9430c3aab8d65

from this pull request https://github.com/FRRouting/frr/commit/f89270226297ec1f1a8290481d1dc7fb66d71422

since this, it doest't work anymore

louberger commented 6 years ago

Please try setting: net.ipv4.tcp_l3mdev_accept=0

Also what kernel rev are you running?

louberger commented 6 years ago

@rwestphal didn't you find that there was an kernel version that didn't work with this change? What happened with that?

aderumier commented 6 years ago

@louberger

net.ipv4.tcp_l3mdev_accept=0 -> doesn't help

I'm using 4.15.17 kernel. (I can test other kernels if you want)

rwestphal commented 6 years ago

@louberger yes, I had the exact same problem in the past week. bgpd is having issues after commit 7e0c80e, but only when using recent Linux kernels (apparently v4.14+).

I made this topology to illustrate the problem: https://gist.github.com/rwestphal/545473123cd967f73dc52872ed37c2dc

Please see the output below:

# vtysh -c "show ip bgp vrf all summary"

Instance Default:

IPv4 Unicast Summary:
BGP router identifier 10.0.0.1, local AS number 1 vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 21 KiB of memory

Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
10.0.0.2        4          1       0       0        0    0    0    never       Active

Total number of neighbors 1

Instance rt1-RED:

IPv4 Unicast Summary:
BGP router identifier 10.0.1.1, local AS number 1 vrf-id 2
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 21 KiB of memory

Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
10.0.1.2        4          1      20      20        0    0    0 00:17:21            0

Total number of neighbors 1

Instance rt1-BLUE:

IPv4 Unicast Summary:
BGP router identifier 10.0.2.1, local AS number 1 vrf-id 3
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 1, using 21 KiB of memory

Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
10.0.2.2        4          1      20      20        0    0    0 00:17:21            0

Total number of neighbors 1
# sysctl net.ipv4.tcp_l3mdev_accept
net.ipv4.tcp_l3mdev_accept = 0

In short, when BGP is enabled in one or more VRFs, the BGP instance running on the default VRF is affected and can't establish a TCP connection to the remotes peer anymore.

As we can see below, bgpd opens the expected TCP sockets normally, but the kernel for some reason is sending TCP RSTs after completing the TCP handshake:

# ss -tan4 | grep 179
LISTEN     0      128    *%rt1-BLUE:179                      *:*
LISTEN     0      128    *%rt1-RED:179                      *:*
LISTEN     0      128          *:179                      *:*
SYN-RECV   0      0      10.0.0.1%rt1-BLUE:179                10.0.0.2:42230
ESTAB      0      0      10.0.2.1%rt1-BLUE:52138              10.0.2.2:179
ESTAB      0      0      10.0.1.1%rt1-RED:38158              10.0.1.2:179

ws

The bgpd log file shows lots of this:

2018/06/17 12:43:12 BGP: 10.0.1.2 [FSM] Timer (keepalive timer expire)
2018/06/17 12:43:12 BGP: 10.0.2.2 [FSM] Timer (keepalive timer expire)
2018/06/17 12:43:13 BGP: 10.0.0.2 [FSM] Timer (connect timer expire)
2018/06/17 12:43:13 BGP: 10.0.0.2 [FSM] ConnectRetry_timer_expired (Active->Connect), fd -1
2018/06/17 12:43:13 BGP: 10.0.0.2 [Event] Connect start to 10.0.0.2 fd 28
2018/06/17 12:43:13 BGP: 10.0.0.2 [FSM] Non blocking connect waiting result, fd 28
2018/06/17 12:43:13 BGP: 10.0.0.2 went from Active to Connect
2018/06/17 12:43:13 BGP: 10.0.0.2 [Event] Connect failed 104(Connection reset by peer)
2018/06/17 12:43:13 BGP: 10.0.0.2 [FSM] TCP_connection_open_failed (Connect->Active), fd 28
2018/06/17 12:43:13 BGP: 10.0.0.2 went from Connect to Active
2018/06/17 12:44:12 BGP: 10.0.1.2 [FSM] Timer (keepalive timer expire)
2018/06/17 12:44:12 BGP: 10.0.2.2 [FSM] Timer (keepalive timer expire)

Using kernel v4.12, the topology above works normally, so I'm afraid this might be a bug introduced recently in the Linux kernel. Once I have some time I'll try to do a git bisect and find the offending commit. For now the workaround is to either a) use an older kernel or b) revert commit 7e0c80e.

aderumier commented 6 years ago

I have tested with kernel 4.13.16, it's working fine. so it must be same bug.

louberger commented 6 years ago

I have a speculative workaround that I have in mind. Are you willing to try it?


On June 17, 2018 12:07:19 PM alexandre derumier notifications@github.com wrote:

I have tested with kernel 4.13.16, it's working fine. so it must be same bug.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/FRRouting/frr/issues/2460#issuecomment-397888871

donaldsharp commented 6 years ago

If we are having kernel version issues, we should get a kernel person involved to make sure nothing more serious is going on.

aderumier commented 6 years ago

tested kernel 4.14.0, don't work.

Edited : 4.14rc1 don't work. 4.17 don't work

@louberger : I have time to test tomorrow if needed.

louberger commented 6 years ago

please see if #2475 fixes your issue (with net.ipv4.tcp_l3mdev_accept=1)

aderumier commented 6 years ago

@louberger

Thanks, #2475 fix it for me (kernel >= 4.14 + net.ipv4.tcp_l3mdev_accept=1).

works also on 4.13 kernel, with or without net.ipv4.tcp_l3mdev_accept=1

louberger commented 6 years ago

Thank you for the test results! While this is a good change to have for the long term, we should also get with the kernel folks to understand what happened in 4.14...


On June 18, 2018 12:03:45 AM alexandre derumier notifications@github.com wrote:

@louberger

Thanks, #2475 fix it for me (kernel >= 4.14 + net.ipv4.tcp_l3mdev_accept=1).

works also on 4.13 kernel, with or without net.ipv4.tcp_l3mdev_accept=1

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/FRRouting/frr/issues/2460#issuecomment-397936695

aderumier commented 6 years ago

@louberger maybe

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.14.y&id=19a2afbea89f93d0e4ac09f8c4720c8afcfb5e6a

or https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.14.y&id=9bcb5a572fd6aed8fd1974ea24830f8a657cbfa2

?

rwestphal commented 6 years ago

@aderumier those seem to be good guesses.

So, I've managed to reproduce the same problem using a simple netcat-like program, which confirms this is a kernel issue.

One interesting thing is this: if the TCP socket for the default VRF is the last one to be opened, then everthing works perfectly.

However, if you change your config from this:

router bgp 1234
 [snip]
!
router bgp 1234 vrf vrf1
 [snip]
!

To this:

router bgp 1234 vrf vrf1
 [snip]
!
router bgp 1234
 [snip]
!

Nothing will change because the VRF sockets are created only after bgpd establishes a connection to zebra. So the workaround would be to configure the main BGP instance using vtysh or telnet after the BGP VRF instances are configured.

If we are having kernel version issues, we should get a kernel person involved to make sure nothing more serious is going on.

Definitely a good idea :)

dsahern commented 6 years ago

said kernel person is here .... I am missing something about the problem: are you saying bgpd has per-VRF sockets (a socket bound to each VRF bgp is configured to use) AND a global (not bound to anything) socket?

donaldsharp commented 6 years ago

After discussions on slack, This is a kernel issue introduced in 4.14. And has put forward a fix for this issue. We now need to get this back ported(in progress).

The workaround while we are waiting is to do this:

2475 fixes your issue (with net.ipv4.tcp_l3mdev_accept=1)

aderumier commented 6 years ago

@donaldsharp @louberger

Hi, do we have some news of kernel dev about this bug ? Any reference of the kernel bug ?

louberger commented 6 years ago

It should be in a forthcoming version of the kernel

https://patchwork.ozlabs.org/patch/931179/

4.14.57 https://lkml.org/lkml/2018/7/20/544

4.17.9 https://lore.kernel.org/patchwork/patch/965438/

Lou

On 8/12/2018 7:29 AM, alexandre derumier wrote:

@donaldsharp https://github.com/donaldsharp @louberger https://github.com/louberger

Hi, do we have some news of kernel dev about this bug ? Any reference of the kernel bug ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FRRouting/frr/issues/2460#issuecomment-412336443, or mute the thread https://github.com/notifications/unsubscribe-auth/AGRWyMh5Ffp7M2lOdE9luqh1P9sDIQzcks5uQBGrgaJpZM4Upj5F.

pacovn commented 5 years ago

@aderumier Hi, we had a similar issue, with a L3VPN setup using Docker containers, having BGP sessions not established despite neighbors doing the bind/listen/accept correctly because the VRF was not doing the forwarding (without containers it worked ok). In our case, using Ubuntu 16.04 in a test environment, the bug was there until kernel 4.15.0-45, being fixed with kernel 4.15.0-46 (4.15.0-46 changelog). Without containers it worked OK (with 4.15.0-45 we tried "privileged" and "super privileged" containers, no luck either). So may be there were many VRF corner cases affecting different things. Hope it helps :-)