Closed ydahhrk closed 4 years ago
FWIW, IPv6 packets that have nothing to do with Jool (forwarded or locally originated) will also be impacted. From a practical standpoint, this issue means that an operator cannot co-locate a Jool instance on a host that also serves as a traditional router/firewall, unless the IPv6 network Jool will translate to/from is 100% under the operator's control (so he can ascertain that there are no IPv6 MTUs lower than 1500).
I am trying to deploy Jool to an Ubuntu 16.04.2 VM on Google Compute Engine (GCE). GCE's MTU is 1460 and I am having performance problems related to fragmentation. I am trying to # curl -SL --retry 5 https://github.com/containernetworking/plugins/releases/download/v0.6.0-rc1/cni-plugins-amd64-v0.6.0-rc1.tgz > y
.
When I use the IPv4 version of the curl command, performance is 1000x faster. This is what I see on the VM when I tcpdump the IPv6 version of the curl command:
<SNIP>
docker-user@k8s-dind:~$ sudo tcpdump -n -i ens4 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes
23:05:58.451493 IP 10.138.0.2 > 192.30.255.112: ICMP 10.138.0.2 unreachable - need to frag (mtu 1420), length 556
23:05:58.451515 IP 10.138.0.2 > 192.30.255.112: ICMP 10.138.0.2 unreachable - need to frag (mtu 1420), length 556
23:05:58.806472 IP 10.138.0.2 > 192.30.255.112: ICMP 10.138.0.2 unreachable - need to frag (mtu 1420), length 556
23:06:04.749526 IP 10.138.0.2 > 52.216.96.91: ICMP 10.138.0.2 unreachable - need to frag (mtu 1420), length 556
<SNIP>
$ sudo tcpdump -n -i ens4 port 443
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes
23:06:53.101915 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 263445155:263445667, ack 1584654857, win 66, length 512
23:06:53.102024 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 512, win 1402, length 0
23:06:53.165651 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [P.], seq 1536:2560, ack 1, win 66, length 1024
23:06:53.165775 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 512, win 1402, options [nop,nop,sack 1 {1536:2560}], length 0
23:06:53.229433 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 512:1536, ack 1, win 66, length 1024
23:06:53.229493 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 2560:3072, ack 1, win 66, length 512
23:06:53.229566 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 2560, win 1395, length 0
23:06:53.229574 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 3072, win 1391, length 0
23:06:53.293455 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 3072:4608, ack 1, win 66, length 1536
23:06:53.562034 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 3072:3584, ack 1, win 66, length 512
23:06:53.562155 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 3584, win 1402, length 0
23:06:53.625957 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 4608:5632, ack 1, win 66, length 1024
23:06:53.626057 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 3584, win 1402, options [nop,nop,sack 1 {4608:5632}], length 0
23:06:53.689930 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 3584:4608, ack 1, win 66, length 1024
23:06:53.689994 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [P.], seq 5632:6144, ack 1, win 66, length 512
23:06:53.690089 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 5632, win 1395, length 0
23:06:53.690098 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 6144, win 1391, length 0
23:06:53.754089 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 6144:7680, ack 1, win 66, length 1536
23:06:54.022030 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 6144:6656, ack 1, win 66, length 512
23:06:54.022158 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 6656, win 1402, length 0
23:06:54.085834 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 7680:8704, ack 1, win 66, length 1024
23:06:54.085961 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 6656, win 1402, options [nop,nop,sack 1 {7680:8704}], length 0
23:06:54.149588 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 6656:7680, ack 1, win 66, length 1024
23:06:54.149649 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [P.], seq 8704:9216, ack 1, win 66, length 512
23:06:54.149722 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 8704, win 1395, length 0
23:06:54.149729 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 9216, win 1391, length 0
23:06:54.213405 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 9216:10752, ack 1, win 66, length 1536
23:06:54.482063 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 9216:9728, ack 1, win 66, length 512
23:06:54.482197 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 9728, win 1402, length 0
23:06:54.546001 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 10752:11776, ack 1, win 66, length 10
I have tried adding a mtu size: --mtu-plateaus: 65535,32000,17914,8166,4352,2002,1492,1460,1006,508,296,68
I have changed the mtu size of the Docker bridge, veth's and the container's eth0 interface but nothing is helping.
cc @diverdane @pmichali
When I use the IPv4 version of the curl command, performance is 1000x faster.
I don't think that IP fragmentation should induce this level of catastrophe. Can it?
Quick check: Are you positive that there is no offloading going on? Because, at least on a quick look, this does seem like a typical case of GRO/LRO-induced black-holing.
If you’re running Jool in a guest virtual machine, something important to keep in mind is that you might rather or also have to disable offloads in the VM host’s uplink interface.
This might be relevant since you're running Jool in a contained environment.
@ydahhrk thank you for your response. I disabled offloads on the Ubuntu VM and in the container according to this doc but still no luck. As you mentioned, offloads should be disabled on the host's uplink ports, but my Ubuntu VM is running on a GCE host so that is not possible.
I have tested hosting the same tarball on a nginx container and curl -6
the tarball from the test client container. I see a 1/100 difference in transfer speed between a v6 GUA and a v6 synthesized address (sending traffic to Jool). Is that expected?
root@8fb1069ad999:/# curl -6SLO http://[fd00:dead:beef::3]/cni-plugins-amd64-v0.6.0-rc1.tgz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 14.6M 100 14.6M 0 0 916M 0 --:--:-- --:--:-- --:--:-- 977M
root@8fb1069ad999:/# curl -6SLO http://[64:ff9b::172.18.0.4]/cni-plugins-amd64-v0.6.0-rc1.tgz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 14.6M 100 14.6M 0 0 705k 0 0:00:21 0:00:21 --:--:-- 958k
``
It appears that this fragmentation/performance issue only comes into play when you’re using Jool with only 1 interface for both (translated) IPv4 and (synthesized-IP) IPv6 traffic. When I use Jool with 1 interface, docker pulls are slow due to fragmentation, but when I use Jool with separate interfaces for IPv4 vs. IPv6, docker pulls go at normal speed. I believe that the issue is that when Jool is being used with a single interface, it may be applying the wrong “effective” MTU size for path MTU discovery. For example, if the interface MTU size is 1500, it represents that as its MTU for path MTU discovery… so a sender might send IPv4 packets up to 1480 in payload size (accounting for a 20-octet IPv4 header). However, when a 1480-octet packet gets translated by NAT64 to IPv6, the resulting packet size would be 1520 (1480 payload + 40 octet header), and this would violate the interface MTU, so NAT64 must respond back with a “needs fragmentation”, since it can’t fragment IPv6 packets. When Jool is used with 2 separate interfaces for IPv4 vs. IPv6, this issue is somehow avoided.
We are using two interfaces in the lab and see the issue, but are wondering if it is what @leblancd is saying where the MTU calc is not accounting for the differences in header sizes. If I set the MTU of the host interface and docker to 9000, where no fragmenting occurs, it works. Granted, if packets exceed 8920, I suspect it will fail.
Jool continues to send ICMP unreachables even when the interface MTU's are configured properly. The ens4
interface connects to the GCE IPv4 network and br-2d4dca08dbf8
is the Docker bridge interface that goes to my test client container. I set br-2d4dca08dbf8
MTU to 1480 to account for the 20 byte increase of the IPv6 header.
docker-user@k8s-dind:~$ sudo ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 42:01:0a:8a:00:02 brd ff:ff:ff:ff:ff:ff
17: br-2d4dca08dbf8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:e5:b1:90:46 brd ff:ff:ff:ff:ff:ff
However, Jool continues to send icmp unreachables to the s3-1-w.amazonaws.com
source. I do not see these messages for a working download, but they're continuous for the problematic download through Jool:
18:18:53.332380 IP k8s-dind.c.level-scheme-173421.internal > s3-1-w.amazonaws.com: ICMP k8s-dind.c.level-scheme-173421.internal unreachable - need to frag (mtu 1460), length 556
I see the same icmp errors with or without receive offloads:
$ sudo ethtool --show-offload ens4 | grep receive-offload
generic-receive-offload: off
large-receive-offload: off [fixed]
I can not disable offloads on the host since this is GCE.
I see a 1/10 difference in transfer speed between a v6 GUA and a v6 synthesized address (sending traffic to Jool). Is that expected?
No. Indeed, I think there's something fishy going on here.
I believe that the issue is that when Jool is being used with a single interface, it may be applying the wrong “effective” MTU size for path MTU discovery. For example, if the interface MTU size is 1500, it represents that as its MTU for path MTU discovery… so a sender might send IPv4 packets up to 1480 in payload size (accounting for a 20-octet IPv4 header). However, when a 1480-octet packet gets translated by NAT64 to IPv6, the resulting packet size would be 1520 (1480 payload + 40 octet header), and this would violate the interface MTU, so NAT64 must respond back with a “needs fragmentation”, since it can’t fragment IPv6 packets.
Ok, I'm looking into this.
By the way: If anyone else wants to experiment, if one compiles Jool with debugging messages enabled, it will print some numbers that can help us figure out what it's trying to do.
For example, if Jool's attached interface is too small, it prints me this:
NAT64 Jool: Packet is too big (len: 1428, mtu: 1300)
While, on the other hand, it's translating an ICMP fragmentation needed error, it will print something like this:
NAT64 Jool: Packet MTU: 1300
NAT64 Jool: In dev MTU: 1500
NAT64 Jool: Out dev MTU: 1500
NAT64 Jool: Resulting MTU: 1280
("Packet MTU" is the MTU contained in the incoming ICMP error, "Resulting MTU" should be the number it prints in the resulting ICMP error.)
@danehans: Can you give me these numbers? They should reveal where Jool is messing up.
but are wondering if it is what @leblancd is saying where the MTU calc is not accounting for the differences in header sizes.
It should be accounting for these sizes. If I recall correctly, we have unit tests specifically intended to enforce this.
@danehans You should try to figure out what size the packet that is causing Jool to emit the ICMPv4 need to frag message is. If it's >1500 it is in all likelihood offloading that is causing your problem.
I can not disable offloads on the host since this is GCE.
That, I suspect, might be the root cause. If your VM gets oversized packets from the GCE host, I don't really see how to make it work.
@danehans You should try to figure out what size the packet that is causing Jool to emit the ICMPv4 need to frag message is. If it's >1500 it is in all likelihood offloading that is causing your problem.
Yep. The first number in the following output will give you a rough estimate of the size of the packet Jool received:
NAT64 Jool: Packet is too big (len: 1428, mtu: 1300)
(It's the size of the outgoing packet, not the incoming one. That's why I say "rough estimate".)
I can not disable offloads on the host since this is GCE.
That, I suspect, might be the root cause. If your VM gets oversized packets from the GCE host, I don't really see how to make it work.
+1.
You know what? I think I see how to make it work.
Currently, the following happens in order:
The reason why Jool cannot handle offloads is because they require special treatment, but on some configurations offloaded packets look no different than IP-reassembled fragments. So Jool cannot handle them differently.
So how about creating a module that undoes offloading before the IP reassembly comes into play? This way, Jool will never have to fear mistaking an offload for a fragment.
Granted, it's a vile hack. It's not perfect because it requires an additional module hook, and there's a lot of unnecessary packet mangling along the way, but it's certainly much faster than suffering the black hole and can help when the NIC cannot be tweaked.
(Also, I need to check whether it's actually viable from code. I'm thinking of a particular Netfilter quirk that could get in the way.)
In my host based config, I disabled GRO (LRO already disabled), and it appears to be working!
I’ll try the suggestions you have regarding Jool in container, and if that works, we can disable GRO on the VM for GRE and maybe things will work for that case.
Fingers crossed…
I am unable to compile the jool kernel module according to the provided document. Here are the details.
I am unable to compile the jool kernel module according to the provided document. Here are the details.
Wait. How so? I don't see any errors in the output.
Wait. Actually, I don't see the make install
command. (Or make modules_install && depmod
)
@ydahhrk I am getting module not founf when I try the modprobe step:
docker-user@k8s-dind:~/Jool-3.5.4/mod$ sudo modprobe -r jool
modprobe: FATAL: Module jool not found.
docker-user@k8s-dind:~/Jool-3.5.4/mod$ sudo modprobe -r jool_siit
modprobe: FATAL: Module jool_siit not found.
sudo make modules_install && sudo depmod
;p
@ydahhrk $ sudo make modules_install && sudo depmod
helped. Thanks.
2018-11-25 Update
Hello. If you came here from the survey, you'll notice that this thread comes from rather nowhere. Please read this to get some context.
Progress: There has been no progress on this feature. I'm not even sure if it's possible to fix given the packet API that the kernel exports to us.
Original post
Our solution to #121 sucks.
Plagiarized from one of Tore Anderson's e-mails:
For this issue to be closed, Jool needs to behave as follows:
Because 3.3 doesn't have
--minimum-ipv6-mtu
, what Jool 3.3 currently does is this:(Jool 3.2 used to do something different, which was also wrong.)
The reasoning is, we're asking the user to set
nexthop MTU = --minimum-ipv6-mtu
. While this doesn't actually break anything, it introduces needless fragmentation and artificially small MTUs.nexthop MTU
and--minimum-ipv6-mtu
need to be separate variables because some packets should be affected by the former but not the latter:--minimum-ipv6-mtu
but don't cross any--minimum-ipv6-mtu
-MTU'd links along their way.Thanks to Tore Anderson for reporting this.