FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.33k stars 1.25k forks source link

MTU issue on GRE tunnels when bgpd daemon is running #5201

Closed jdrouvroy closed 4 years ago

jdrouvroy commented 5 years ago

Describe the bug

First of all, i'll explain my topology. I have 3 sites with 2 routers per site (In fact we have 9 sites but in this case only 3 sites are required to understand my problem). Orange router is the main per site (default gateway for server), and the grey one is backup router (via VRRP protocol) On the figure, each green links are IPSEC VPN tunnels mounted with StrongSwan. In those tunnels, there is a GRE interface which mount GRE tunnel between nodes on each side of the VPN IPSEC (tunnels and GRE IP are in black on figure). Each node have Loopback address (IP in blue on figure). Of course, each node have LAN ip (IP in green on figure). Each node also have BGP configuration, to announce the routables networks. I have weird issue with MTU when 2 servers need to communicate through this infrastructure. When i'm pinging server 10.55.0.69 from 172.16.97.2 (Black servers on figure), everything works correctly and i'm able to see traffic going into and leave routers.

ping 10.55.0.69 PING 10.55.0.69 (10.55.0.69) 56(84) bytes of data. 64 bytes from 10.55.0.69: icmp_seq=1 ttl=60 time=13.6 ms 64 bytes from 10.55.0.69: icmp_seq=2 ttl=60 time=13.9 ms 64 bytes from 10.55.0.69: icmp_seq=3 ttl=60 time=13.9 ms 64 bytes from 10.55.0.69: icmp_seq=4 ttl=60 time=14.1 ms ^C --- 10.55.0.69 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3002ms rtt min/avg/max/mdev = 13.675/13.947/14.169/0.228 ms

On orange router on site 8 (closest to destination server) i did this capture (everything is ok):

tcpdump -ni ens160 icmp and host 10.55.0.69 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens160, link-type EN10MB (Ethernet), capture size 262144 bytes 08:25:48.562284 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 37119, seq 1, length 64 08:25:48.562389 IP 10.55.0.69 > 172.16.97.2: ICMP echo reply, id 37119, seq 1, length 64 08:25:49.563329 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 37119, seq 2, length 64 08:25:49.563449 IP 10.55.0.69 > 172.16.97.2: ICMP echo reply, id 37119, seq 2, length 64 08:25:50.564448 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 37119, seq 3, length 64 08:25:50.564593 IP 10.55.0.69 > 172.16.97.2: ICMP echo reply, id 37119, seq 3, length 64 08:25:51.565591 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 37119, seq 4, length 64 08:25:51.565736 IP 10.55.0.69 > 172.16.97.2: ICMP echo reply, id 37119, seq 4, length 64 ^C 8 packets captured 8 packets received by filter 0 packets dropped by kernel

Now, i'm sending same ping, with don't fragment instruction and 1300bytes from 172.16.97.2 server

ping 10.55.0.69 -s 1300 -M do PING 10.55.0.69 (10.55.0.69) 1300(1328) bytes of data. From 10.255.0.73 icmp_seq=1 Frag needed and DF set (mtu = 894) ping: local error: Message too long, mtu=894 ping: local error: Message too long, mtu=894 ping: local error: Message too long, mtu=894 ping: local error: Message too long, mtu=894 ping: local error: Message too long, mtu=894 ping: local error: Message too long, mtu=894 ping: local error: Message too long, mtu=894 ^C --- 10.55.0.69 ping statistics --- 8 packets transmitted, 0 received, +8 errors, 100% packet loss, time 7040ms

Network capture on orange router on site 7

tcpdump -ni any host 172.16.97.2 and icmp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 08:32:27.586718 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 37658, seq 1, length 1308 08:32:27.586751 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 37658, seq 1, length 1308 08:32:27.598692 IP 10.255.0.73 > 172.16.97.2: ICMP 10.55.0.69 unreachable - need to frag (mtu 894), length 556
08:32:27.598705 IP 10.255.0.73 > 172.16.97.2: ICMP 10.55.0.69 unreachable - need to frag (mtu 894), length 556
^C 4 packets captured 4 packets received by filter 0 packets dropped by kernel

Network capture on orange router on site 2

tcpdump -ni any host 172.16.97.2 and icmp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 08:32:27.596503 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 37658, seq 1, length 1308 08:32:27.596527 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 37658, seq 1, length 1308 08:32:27.596548 IP 10.255.0.73 > 172.16.97.2: ICMP 10.55.0.69 unreachable - need to frag (mtu 894), length 556
^C 3 packets captured 3 packets received by filter 0 packets dropped by kernel

Get route and MTU on orange router on Site 7

ip route get to 10.55.0.69 10.55.0.69 via 10.255.0.73 dev tunnel18 src 10.255.0.74 uid 0
cache expires 309sec mtu 1398

Get route and MTU on orange router on Site 2

ip route get to 10.55.0.69 10.55.0.69 via 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache expires 329sec mtu 894

Let's try to reduce the ping size to 800 bytes (because of the cache expires 329sec mtu 894 in previous command)

ping 10.55.0.69 -s 800 -M do PING 10.55.0.69 (10.55.0.69) 800(828) bytes of data. From 10.255.0.73 icmp_seq=1 Frag needed and DF set (mtu = 606) ping: local error: Message too long, mtu=606 ping: local error: Message too long, mtu=606 ping: local error: Message too long, mtu=606 ping: local error: Message too long, mtu=606 ^C --- 10.55.0.69 ping statistics --- 5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 4024ms

Network capture on orange router on site 7

tcpdump -ni any host 172.16.97.2 and icmp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 08:39:42.735130 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 38450, seq 1, length 808 08:39:42.735143 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 38450, seq 1, length 808 08:39:42.746899 IP 10.255.0.73 > 172.16.97.2: ICMP 10.55.0.69 unreachable - need to frag (mtu 606), length 556
08:39:42.746911 IP 10.255.0.73 > 172.16.97.2: ICMP 10.55.0.69 unreachable - need to frag (mtu 606), length 556
^C 4 packets captured 4 packets received by filter 0 packets dropped by kernel

ip route get to 10.55.0.69 10.55.0.69 via 10.255.0.73 dev tunnel18 src 10.255.0.74 uid 0 cache expires 582sec mtu 1398

Network capture on orange router on site 2

tcpdump -ni any host 172.16.97.2 and icmp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 08:39:42.745058 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 38450, seq 1, length 808 08:39:42.745126 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 38450, seq 1, length 808 08:39:42.745160 IP 10.255.0.73 > 172.16.97.2: ICMP 10.55.0.69 unreachable - need to frag (mtu 606), length 556
^C 3 packets captured 3 packets received by filter 0 packets dropped by kernel

ip route get to 10.55.0.69 10.55.0.69 via 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache expires 582sec mtu 606

Strangely, MTU decreased from 894 to 606 bytes

Also, during another test i had weird thing (it seems that MTU was dynamically updated ...):

ping 10.55.0.69 -s 1010 -M do PING 10.55.0.69 (10.55.0.69) 1010(1038) bytes of data. 1018 bytes from 10.55.0.69: icmp_seq=1 ttl=60 time=13.7 ms 1018 bytes from 10.55.0.69: icmp_seq=2 ttl=60 time=13.8 ms 1018 bytes from 10.55.0.69: icmp_seq=3 ttl=60 time=14.1 ms 1018 bytes from 10.55.0.69: icmp_seq=4 ttl=60 time=14.2 ms 1018 bytes from 10.55.0.69: icmp_seq=5 ttl=60 time=13.9 ms From 10.255.0.73 icmp_seq=6 Frag needed and DF set (mtu = 966) ping: local error: Message too long, mtu=966 ping: local error: Message too long, mtu=966 ping: local error: Message too long, mtu=966 ping: local error: Message too long, mtu=966 ping: local error: Message too long, mtu=966 ^C --- 10.55.0.69 ping statistics --- 11 packets transmitted, 5 received, +6 errors, 54% packet loss, time 10040ms rtt min/avg/max/mdev = 13.759/14.000/14.297/0.226 ms

Here is FRR configuration of orange router on SITE 2:

frr version 7.1 frr defaults traditional hostname $orange_router_site_2
log file /var/log/frr/bgpd.log informational log stdout log syslog informational service password-encryption no ipv6 forwarding service integrated-vtysh-config ! password *** enable password **** ! ip route 10.255.2.11/32 10.255.0.1 ip route 10.255.2.71/32 10.255.0.74 ip route 10.255.2.81/32 10.255.0.97 ip route 10.255.2.91/32 10.255.0.113 ip route 10.255.2.31/32 10.255.0.14 ip route 10.255.2.41/32 10.255.0.18 ip route 10.255.1.0/24 10.255.255.2 ip route 10.255.2.22/32 10.255.255.2 ! interface eth0 description Ethernet interface ip address 10.58.95.4/28 ! interface lo description Loopback interface
ip address 10.255.2.21/32 ! interface tunnel0 description BGPoGREoIPSEC ip address 10.255.0.2/30 ! interface tunnel1000 description GRE interface for router onsite ip address 10.255.255.1/30 ! interface tunnel18 description BGPoGREoIPSEC *
ip address 10.255.0.73/30 ! interface tunnel24 description BGPoGREoIPSEC ** ip address 10.255.0.98/30 ! interface tunnel28 description BGPoGREoIPSEC **** ip address 10.255.0.114/30 ! interface tunnel3 description BGPoGREoIPSEC ***** ip address 10.255.0.13/30 ! interface tunnel4 description BGPoGREoIPSEC ** ip address 10.255.0.17/30 ! router bgp 65002 bgp router-id 10.255.2.21 neighbor CLOUD-INTERCONNECT peer-group neighbor CLOUD-INTERCONNECT ebgp-multihop 5 neighbor CLOUD-INTERCONNECT update-source lo neighbor CLOUD-INTERCONNECT timers 1 3 ! router bgp 65002 neighbor 10.255.2.11 remote-as 65001 neighbor 10.255.2.11 peer-group CLOUD-INTERCONNECT neighbor 10.255.2.11 description eBGP Session with ! router bgp 65002 neighbor 10.255.2.71 remote-as 65007 neighbor 10.255.2.71 peer-group CLOUD-INTERCONNECT neighbor 10.255.2.71 description eBGP Session with ! router bgp 65002 neighbor 10.255.2.81 remote-as 65008 neighbor 10.255.2.81 peer-group CLOUD-INTERCONNECT neighbor 10.255.2.81 description eBGP Session with ! router bgp 65002 neighbor 10.255.2.91 remote-as 65009 neighbor 10.255.2.91 peer-group CLOUD-INTERCONNECT neighbor 10.255.2.91 description eBGP Session with ! router bgp 65002 neighbor 10.255.2.31 remote-as 65003 neighbor 10.255.2.31 peer-group CLOUD-INTERCONNECT neighbor 10.255.2.31 description eBGP Session with ! router bgp 65002 neighbor 10.255.2.41 remote-as 65004 neighbor 10.255.2.41 peer-group CLOUD-INTERCONNECT neighbor 10.255.2.41 description eBGP Session with ! router bgp 65002 neighbor 10.255.2.22 remote-as 65002 neighbor 10.255.2.22 description iBGP Session with router on same site neighbor 10.255.2.22 update-source lo ! address-family ipv4 unicast neighbor 10.255.2.11 route-map US-EU-FILTER-IN in exit-address-family ! address-family ipv4 unicast neighbor 10.255.2.31 route-map EU-ASIA-FILTER-IN in exit-address-family ! address-family ipv4 unicast neighbor 10.255.2.22 next-hop-self neighbor 10.255.2.22 soft-reconfiguration inbound exit-address-family ! address-family ipv4 unicast network 10.58.64.0/19 neighbor CLOUD-INTERCONNECT soft-reconfiguration inbound neighbor CLOUD-INTERCONNECT maximum-prefix 10000 exit-address-family ! route-map EU-ASIA-FILTER-IN permit 10 set as-path prepend last-as 3 ! route-map US-ASIA-FILTER-IN permit 10 set as-path prepend last-as 2 ! route-map US-EU-FILTER-IN permit 10 set as-path prepend last-as 1 ! line vty !

Here is FRR configuration of orange router on SITE 7:

frr version 7.1 frr defaults traditional hostname $orange_router_site_2 log file /var/log/frr/bgpd.log informational log stdout log syslog informational service password-encryption no ipv6 forwarding service integrated-vtysh-config ! password **** enable password * ! ip route 10.255.2.21/32 10.255.0.73 ip route 10.255.2.41/32 10.255.0.77 ip route 10.255.1.0/24 10.255.255.2 ip route 10.255.2.72/32 10.255.255.2 ip route 172.16.97.2/32 10.56.255.14 ! interface ens160 description Ethernet interface ip address 10.56.255.4/28 ! interface lo description Loopback interface
ip address 10.255.2.71/32 ! interface tunnel1000 description GRE interface for router onsite ip address 10.255.255.1/30 ! interface tunnel18 description BGPoGREoIPSEC ***
ip address 10.255.0.74/30 ! interface tunnel19 description BGPoGREoIPSEC **** ip address 10.255.0.78/30 ! router bgp 65007 bgp router-id 10.255.2.71 neighbor CLOUD-INTERCONNECT peer-group neighbor CLOUD-INTERCONNECT ebgp-multihop 5 neighbor CLOUD-INTERCONNECT update-source lo neighbor CLOUD-INTERCONNECT timers 1 3 ! router bgp 65007 neighbor 10.255.2.21 remote-as 65002 neighbor 10.255.2.21 peer-group CLOUD-INTERCONNECT neighbor 10.255.2.21 description eBGP Session with ! router bgp 65007 neighbor 10.255.2.41 remote-as 65004 neighbor 10.255.2.41 peer-group CLOUD-INTERCONNECT neighbor 10.255.2.41 description eBGP Session with ! router bgp 65007 neighbor 10.255.2.72 remote-as 65007 neighbor 10.255.2.72 description iBGP Session with router on same site neighbor 10.255.2.72 update-source lo ! address-family ipv4 unicast neighbor 10.255.2.72 next-hop-self neighbor 10.255.2.72 soft-reconfiguration inbound exit-address-family ! address-family ipv4 unicast network 10.56.0.0/19 network 172.16.97.2/32 neighbor CLOUD-INTERCONNECT soft-reconfiguration inbound neighbor CLOUD-INTERCONNECT maximum-prefix 10000 exit-address-family ! route-map EU-ASIA-FILTER-IN permit 10 set as-path prepend last-as 3 ! route-map US-ASIA-FILTER-IN permit 10 set as-path prepend last-as 2 ! route-map US-EU-FILTER-IN permit 10 set as-path prepend last-as 1 ! line vty !

My question is : Why the MTU is dynamically decreased like that ? I'm aware there is overhead on GRE encapsulation but that's not explain why it's decreased so much I hope that i provide enough information to troubleshoot my issue :)

Thanks in advance

(put "x" in "[ ]" if you already tried following) [x] Did you check if this is a duplicate issue? [ ] Did you test it on the latest FRRouting/frr master branch?

Versions

jdrouvroy commented 5 years ago

image

qlyoung commented 5 years ago

@jdrouvroy we're not quite understanding how this is an FRR bug - as FRR does not create GRE tunnels, just uses existing ones, this sounds like a misconfig in your tunnel setup and not something related to FRR per se.

louberger commented 5 years ago

FRR doesn't control interface MTU -- take a look that your interface and tunnel mtus are set properly at the os level. You find additional info useful at https://www.google.com/search?q=gre+over+ipsec+set+tunnel+mtu

bisdhdh commented 4 years ago

@jdrouvroy Please verify if path MTU discovery is enabled on that tunnel interface

Currrent value: sysctl net.ipv4.ip_no_pmtu_disc Set it to 1: sysctl -w net.ipv4.ip_no_pmtu_disc=1 Check: sysctl net.ipv4.ip_no_pmtu_disc

jdrouvroy commented 4 years ago

Hi @bisdhdh,

Thank you for your reply. I disabled path MTU discovery on each routers, but same issue :(

louberger commented 4 years ago

What's the MTU on the GRE interface? Does sending up to that size work without fragmentation?


On October 25, 2019 8:00:45 AM jdrouvroy notifications@github.com wrote:

Hi @bisdhdh,

Thank you for your reply. I disabled path MTU discovery on each routers, but same issue :(

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/FRRouting/frr/issues/5201#issuecomment-546325791

jdrouvroy commented 4 years ago

Hi @louberger,

I'm sorry but i do not understand your question ^^

jdrouvroy commented 4 years ago

up @louberger ;)

qlyoung commented 4 years ago

@jdrouvroy Lou updated his question

jdrouvroy commented 4 years ago

Sorry @qlyoung, didn't see edited post

So, before ping, i run on site 7 orange router this commands (MTU is set to 1476 for this GRE interface :

ip route get to 10.55.0.69 10.55.0.69 via 10.255.0.73 dev tunnel18 src 10.255.0.74 uid 0 cache

ip a | grep -A 3 tunnel18 7: tunnel18@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1476 qdisc noqueue state UNKNOWN group default qlen 1000 link/gre 10.56.255.4 peer 10.58.95.4 inet 10.255.0.74/30 brd 10.255.0.75 scope global tunnel18 valid_lft forever preferred_lft forever inet6 fe80::5efe:a38:ff04/64 scope link valid_lft forever preferred_lft forever

Same commands on site 2 orange router (MTU is also 1476):

ip route get to 10.55.0.69 10.55.0.69 via 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 4001 cache

ip a | grep -A 3 tunnel24 9: tunnel24@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1476 qdisc noqueue state UNKNOWN group default qlen 1000 link/gre 10.58.95.4 peer 10.55.0.4 inet 10.255.0.98/30 brd 10.255.0.99 scope global tunnel24 valid_lft forever preferred_lft forever inet6 fe80::5efe:a3a:5f04/64 scope link valid_lft forever preferred_lft forever

Then, I ping (from 172.16.97.2), and everything is OK:

ping 10.55.0.69 PING 10.55.0.69 (10.55.0.69) 56(84) bytes of data. 64 bytes from 10.55.0.69: icmp_seq=1 ttl=60 time=13.7 ms 64 bytes from 10.55.0.69: icmp_seq=2 ttl=60 time=13.7 ms 64 bytes from 10.55.0.69: icmp_seq=3 ttl=60 time=13.7 ms 64 bytes from 10.55.0.69: icmp_seq=4 ttl=60 time=13.5 ms 64 bytes from 10.55.0.69: icmp_seq=5 ttl=60 time=13.6 ms 64 bytes from 10.55.0.69: icmp_seq=6 ttl=60 time=13.7 ms ^C --- 10.55.0.69 ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 5007ms rtt min/avg/max/mdev = 13.584/13.707/13.762/0.162 ms

Next, i ping with 1476 bytes (no response):

ping 10.55.0.69 -s 1476 PING 10.55.0.69 (10.55.0.69) 1476(1504) bytes of data. ^C --- 10.55.0.69 ping statistics --- 10 packets transmitted, 0 received, 100% packet loss, time 9063ms

And finally, ping with lower size (no response):

ping 10.55.0.69 -s 1000 PING 10.55.0.69 (10.55.0.69) 1000(1028) bytes of data. ^C --- 10.55.0.69 ping statistics --- 5 packets transmitted, 0 received, 100% packet loss, time 4024ms

let's capture on Site 7 orange router :

tcpdump -ni any host 10.55.0.69 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
19:50:51.399026 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7302, seq 1, length 1480
19:50:51.399037 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:50:51.399058 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7302, seq 1, length 1456
19:50:51.399074 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:50:52.398905 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7302, seq 2, length 1480
19:50:52.398917 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:50:52.398938 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7302, seq 2, length 1456
19:50:52.398957 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:50:53.398752 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7302, seq 3, length 1480 19:50:53.398763 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:50:53.398782 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7302, seq 3, length 1456 19:50:53.398801 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:50:54.398792 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7302, seq 4, length 1480 19:50:54.398802 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:50:54.398818 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7302, seq 4, length 1456 19:50:54.398837 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 ^C 16 packets captured 16 packets received by filter 0 packets dropped by kernel

0 ✓ bgp01p ~# ip route get to 10.55.0.69 10.55.0.69 via 10.255.0.73 dev tunnel18 src 10.255.0.74 uid 0 cache expires 578sec mtu 1398

0 ✓ bgp01p ~# ip route get to 10.55.0.69 10.55.0.69 via 10.255.0.73 dev tunnel18 src 10.255.0.74 uid 0 cache expires 568sec mtu 1398

0 ✓ bgp01p ~# tcpdump -ni any host 10.55.0.69 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 19:51:47.771882 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 1, length 1008 19:51:47.771899 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 1, length 1008 19:51:48.780808 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 2, length 1008 19:51:48.780835 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 2, length 1008 ... 10 packets captured 10 packets received by filter 0 packets dropped by kernel 0 ✓ bgp01p ~# ip route get to 10.55.0.69 10.55.0.69 via 10.255.0.73 dev tunnel18 src 10.255.0.74 uid 0 cache expires 525sec mtu 1398

let's capture on Site 2 orange router :

0 ✓ bgp01p ~# ip route get to 10.55.0.69 10.55.0.69 via 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache

0 ✓ bgp01p ~# tcpdump -ni any host 10.55.0.69 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
19:50:51.407561 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:50:52.407384 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:50:53.407293 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:50:54.407270 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 ^C 4 packets captured 4 packets received by filter 0 packets dropped by kernel

0 ✓ bgp01p ~# ip route get to 10.55.0.69 10.55.0.69 via 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache

0 ✓ bgp01p ~# tcpdump -ni any host 10.55.0.69 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
19:51:47.780406 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 1, length 1008
19:51:47.780420 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 1, length 1008
19:51:48.789382 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 2, length 1008
19:51:48.789393 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 2, length 528
19:51:48.789407 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:51:49.797294 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 3, length 1008
19:51:49.797308 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 3, length 528
19:51:49.797325 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:51:50.797111 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 4, length 1008
19:51:50.797146 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 4, length 528
19:51:50.797177 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 19:51:51.805354 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 5, length 1008
19:51:51.805368 IP 172.16.97.2 > 10.55.0.69: ICMP echo request, id 7351, seq 5, length 528
19:51:51.805384 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 ^C 14 packets captured 14 packets received by filter 0 packets dropped by kernel

0 ✓ bgp01p ~# ip route get to 10.55.0.69 10.55.0.69 via 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache expires 581sec mtu lock 552

And a capture on site 8 site :

tcpdump -ni any host 10.55.0.69 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 20:08:47.759564 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 20:08:48.759505 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 20:08:49.759729 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 20:08:50.759803 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 20:08:51.759760 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 20:08:52.759729 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 20:08:53.759644 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 20:08:54.759501 IP 172.16.97.2 > 10.55.0.69: ip-proto-1 ^C 8 packets captured 8 packets received by filter 0 packets dropped by kernel

I don't understand why router on site 2 decrease to 552 for this route ...

jdrouvroy commented 4 years ago

Hello,

We made more tests this morning and i bring you interesting things.

In order to make the problem more understandable, let's considere this case : We would like to ping GRE interface via GRE tunnel between Site 2 and site 8 tunnels

Reminder : 10.255.0.98 is the site 2 router's gre ip to site 8 10.255.0.97 is the site 8 router's gre ip to site 2 GRE tunnel in encapsulated in VPN IPSEC tunnel.

Configuration for tunnel24 on site 2 side(gre tunnel between nodes)

ifconfig tunnel24 tunnel24: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1476 inet 10.255.0.98 netmask 255.255.255.252 destination 10.255.0.98 inet6 fe80::5efe:a3a:5f04 prefixlen 64 scopeid 0x20 unspec 0A-3A-5F-04-6E-6F-00-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC) RX packets 1119125 bytes 249852565 (249.8 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1082029 bytes 62239223 (62.2 MB) TX errors 825 dropped 13218 overruns 0 carrier 0 collisions 0

Config file for GRE interface :

cat /etc/network/interfaces.d/tunnel24.cfg auto tunnel24 iface tunnel24 inet static address 10.255.0.98 netmask 255.255.255.252 broadcast 10.255.0.99 up ifconfig tunnel24 pre-up ip tunnel add tunnel24 mode gre remote 10.55.0.4 local 10.58.95.4 ttl 64

Flush route cache to reset cached MTU

ip route flush cache

Cache is empty

ip route get to 10.255.0.97 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache

Try to ping remote GRE interface:

ping 10.255.0.97 PING 10.255.0.97 (10.255.0.97) 56(84) bytes of data. 64 bytes from 10.255.0.97: icmp_seq=1 ttl=64 time=2.15 ms 64 bytes from 10.255.0.97: icmp_seq=2 ttl=64 time=2.27 ms 64 bytes from 10.255.0.97: icmp_seq=3 ttl=64 time=1.94 ms 64 bytes from 10.255.0.97: icmp_seq=4 ttl=64 time=2.04 ms 64 bytes from 10.255.0.97: icmp_seq=5 ttl=64 time=1.95 ms 64 bytes from 10.255.0.97: icmp_seq=6 ttl=64 time=1.94 ms 64 bytes from 10.255.0.97: icmp_seq=7 ttl=64 time=2.07 ms 64 bytes from 10.255.0.97: icmp_seq=8 ttl=64 time=2.11 ms 64 bytes from 10.255.0.97: icmp_seq=9 ttl=64 time=2.01 ms ^C --- 10.255.0.97 ping statistics --- 9 packets transmitted, 9 received, 0% packet loss, time 8009ms rtt min/avg/max/mdev = 1.945/2.058/2.275/0.110 ms

Get route cache for this remote host

ip route get to 10.255.0.97 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache expires 573sec mtu lock 552

We can see that MTU was automatically updated to 552

Try to ping with 1000 bytes payload

ping 10.255.0.97 -M do -s 1000 PING 10.255.0.97 (10.255.0.97) 1000(1028) bytes of data. ping: local error: Message too long, mtu=552 ping: local error: Message too long, mtu=552 ping: local error: Message too long, mtu=552 ^C --- 10.255.0.97 ping statistics --- 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2023ms

Flush route cache

ip route flush cache

Retry to ping:

ping 10.255.0.97 -M do -s 1000 PING 10.255.0.97 (10.255.0.97) 1000(1028) bytes of data. 1008 bytes from 10.255.0.97: icmp_seq=1 ttl=64 time=2.23 ms 1008 bytes from 10.255.0.97: icmp_seq=2 ttl=64 time=2.25 ms 1008 bytes from 10.255.0.97: icmp_seq=3 ttl=64 time=2.14 ms ping: local error: Message too long, mtu=942 ping: local error: Message too long, mtu=942 ping: local error: Message too long, mtu=942 ^C --- 10.255.0.97 ping statistics --- 7 packets transmitted, 3 received, +3 errors, 57% packet loss, time 6033ms rtt min/avg/max/mdev = 2.142/2.210/2.254/0.048 ms

Ping seems to work, but stop after some packets

Get route cache

0 ✓ [VSBB2]bgp01p ~# ip route get to 10.255.0.97 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache expires 588sec mtu 942

MTU automatically set to 942

Stop BGPD daemon and restart FRR

grep bgp /etc/frr/daemons bgpd=yes bgpd_options=" -A 127.0.0.1"

sed -i 's/bgpd=yes/bgpd=no/' /etc/frr/daemons

grep bgp /etc/frr/daemons bgpd=no bgpd_options=" -A 127.0.0.1"

systemctl restart frr

Flush route cache

ip route get to 10.255.0.97 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache expires 436sec mtu 942

ip route flush cache

ip route get to 10.255.0.97 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache

Retry to ping with 1000 bytes payload

ping 10.255.0.97 -M do -s 1000 PING 10.255.0.97 (10.255.0.97) 1000(1028) bytes of data. 1008 bytes from 10.255.0.97: icmp_seq=1 ttl=64 time=2.31 ms 1008 bytes from 10.255.0.97: icmp_seq=2 ttl=64 time=2.13 ms 1008 bytes from 10.255.0.97: icmp_seq=3 ttl=64 time=2.43 ms 1008 bytes from 10.255.0.97: icmp_seq=4 ttl=64 time=2.61 ms 1008 bytes from 10.255.0.97: icmp_seq=5 ttl=64 time=2.20 ms 1008 bytes from 10.255.0.97: icmp_seq=6 ttl=64 time=2.37 ms 1008 bytes from 10.255.0.97: icmp_seq=7 ttl=64 time=2.56 ms 1008 bytes from 10.255.0.97: icmp_seq=8 ttl=64 time=2.16 ms 1008 bytes from 10.255.0.97: icmp_seq=9 ttl=64 time=2.20 ms 1008 bytes from 10.255.0.97: icmp_seq=10 ttl=64 time=2.04 ms 1008 bytes from 10.255.0.97: icmp_seq=11 ttl=64 time=2.15 ms 1008 bytes from 10.255.0.97: icmp_seq=12 ttl=64 time=2.12 ms 1008 bytes from 10.255.0.97: icmp_seq=13 ttl=64 time=2.22 ms ^C --- 10.255.0.97 ping statistics --- 13 packets transmitted, 13 received, 0% packet loss, time 12015ms rtt min/avg/max/mdev = 2.044/2.273/2.613/0.179 ms

Get route cache informations

ip route get to 10.255.0.97 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache expires 582sec mtu 1398

MTU seems to be normal : 1398 bytes

Let's reactivate BGPD

sed -i 's/bgpd=no/bgpd=yes/' /etc/frr/daemons && grep bgpd /etc/frr/daemons && systemctl restart frr bgpd=yes bgpd_options=" -A 127.0.0.1"

Flush route cache

ip route flush cache

ip route get to 10.255.0.97 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache

ping 10.255.0.97 -M do -s 1000 PING 10.255.0.97 (10.255.0.97) 1000(1028) bytes of data. ping: local error: Message too long, mtu=552 ping: local error: Message too long, mtu=552 ^C --- 10.255.0.97 ping statistics --- 3 packets transmitted, 0 received, +2 errors, 100% packet loss, time 2040ms

Check route cache

ip route get to 10.255.0.97 10.255.0.97 dev tunnel24 src 10.255.0.98 uid 0 cache expires 593sec mtu lock 552

Another weird thing :

Sometimes, host received himself forged packet with icmp_seq=1 Frag needed and DF set (mtu = 0)

ping -M do -s 501 10.255.0.97 PING 10.255.0.97 (10.255.0.97) 501(529) bytes of data. From 10.255.0.98 icmp_seq=1 Frag needed and DF set (mtu = 0) ^C --- 10.255.0.97 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

To sum up, MTU issue appear when bgpd daemon is enable. Any idea ?

Thanks for you help

jdrouvroy commented 4 years ago

Hi,

Does someone have idea about that ? We want to put it into production but this problem prevents us from doing it.

Thanks for help ;)

jdrouvroy commented 4 years ago

Hello,

I did tests with quagga and it's works on my case. Thanks to everyone helped on this case.

qlyoung commented 4 years ago

Yeah, still no idea what the problem is here. Only thing we can think of is that your BGP updates might be triggering (kernel) PTMUD, which then adjusts your link MTU. FRR generally produces larger BGP updates than Quagga, as a consequence of improved advertisement efficiency, so that might explain why it works for you under Quagga. I'm not aware of any channels in BGP that could directly (i.e., as a protocol mechanism) interface MTU, though.

You could try disabling PTMUD and see if you notice a difference.

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

ip_no_pmtu_disc - INTEGER

Disable Path MTU Discovery. If enabled in mode 1 and a
fragmentation-required ICMP is received, the PMTU to this
destination will be set to min_pmtu (see below). You will need
to raise min_pmtu to the smallest interface MTU on your system
manually if you want to avoid locally generated fragments.
In mode 2 incoming Path MTU Discovery messages will be
discarded. Outgoing frames are handled the same as in mode 1,
implicitly setting IP_PMTUDISC_DONT on every created socket.
Mode 3 is a hardened pmtu discover mode. The kernel will only
accept fragmentation-needed errors if the underlying protocol
can verify them besides a plain socket lookup. Current
protocols for which pmtu events will be honored are TCP, SCTP
and DCCP as they verify e.g. the sequence number or the
association. This mode should not be enabled globally but is
only intended to secure e.g. name servers in namespaces where
TCP path mtu must still work but path MTU information of other
protocols should be discarded. If enabled globally this mode
could break other protocols.
Possible values: 0-3
Default: FALSE