NICMx / Jool

SIIT and NAT64 for Linux
GNU General Public License v2.0
326 stars 66 forks source link

Setting the interface MTU with "ip link" isn't a full replacement for a correct implementation of --minMTU6 #136

Closed ydahhrk closed 4 years ago

ydahhrk commented 9 years ago

2018-11-25 Update

Hello. If you came here from the survey, you'll notice that this thread comes from rather nowhere. Please read this to get some context.

Progress: There has been no progress on this feature. I'm not even sure if it's possible to fix given the packet API that the kernel exports to us.

Original post

Our solution to #121 sucks.

Plagiarized from one of Tore Anderson's e-mails:

The key point is that --minMTU6 functionality is only supposed to apply when DF=0 (Jool 3.2 is buggy in this regard though, cf. #121), while changing the interface MTU would apply to all traffic, regardless of the DF setting.

Consider the use case where you put a Stateful NAT64 in front of an IPv4-only data centre in order to make the IPv4-only web servers and available to IPv6 clients on the Internet.

In this case, the IPv6 network is the entire Internet. You will now have to set --minMTU to 1280, because if the IPv4 server doesn't perform/support Path MTU Discovery (i.e., it uses DF=0), that's the only way it can reliably communicate through a stateless translator with IPv6 destinations who are behind links/tunnels with MTU=1280 (and there's certainly a nonzero amount of those on the IPv6 internet). Larger packets would cause a PMTU blackhole because the server would just keep retransmitting the too large packets, and ignore any inbound ICMPv4 FNs [Fragmentation Neededs].

However, that doesn't mean that all outgoing packets need to be restricted to 1280 bytes. If another server is using PMTUD (DF=1), and is sending packets to an IPv6 destination whose PMTUD happens to be 1500 bytes (most clients with native IPv6 will have this), then the large packets will make it there just fine - there will be no ICMPv4 FN/ICMPv6 PTB because the PMTU is large enough.

However if the PMTUD-using server is sending packets to an IPv6 desination whose PMTU happens to be 1280 (using a 6in4 tunnel, for example), then PMTUD works across the translator - an IPv6 router will return a PTB [Packet too Big], which will be translated to an ICMPv6 FN, which will reach the server which will then reduce its packet size to match the PMTU.

In summary, when you change the IPVv6 interface MTU of the Jool server to 1280, you're essentially forcing the server's IPv4 PMTUD to result in 1260 for every single destination, even though the actual PMTU might very well be much higher. So it's not an ideal replacement for the [correctly implemented, i.e., only applying to packets with DF=0] --minMTU setting.

For this issue to be closed, Jool needs to behave as follows:

if incoming (IPv4) packet's DF=0,
    if outgoing (IPv6) packet's length > --minimum-ipv6-mtu,
        fragment the IPv6 packet so every chunk lengths --minimum-ipv6-mtu or less.
        forward every fragment.
    else
        forward the IPv6 packet.
else
    if outgoing (IPv6) packet's length > nexthop MTU,
        reply ICMPv4 error "Fragmentation Needed"
    else
        forward the IPv6 packet.

Because 3.3 doesn't have --minimum-ipv6-mtu, what Jool 3.3 currently does is this:

if incoming (IPv4) packet's DF=0,
    forward the IPv6 packet.
    (linux might fragment according to the nexthop MTU.)
else
    if outgoing (IPv6) packet's length > nexthop MTU,
        reply ICMPv4 error "Fragmentation Needed"
    else
        forward the IPv6 packet.
        (linux does not fragment.)

(Jool 3.2 used to do something different, which was also wrong.)

The reasoning is, we're asking the user to set nexthop MTU = --minimum-ipv6-mtu. While this doesn't actually break anything, it introduces needless fragmentation and artificially small MTUs. nexthop MTU and --minimum-ipv6-mtu need to be separate variables because some packets should be affected by the former but not the latter:

Thanks to Tore Anderson for reporting this.

toreanderson commented 9 years ago

FWIW, IPv6 packets that have nothing to do with Jool (forwarded or locally originated) will also be impacted. From a practical standpoint, this issue means that an operator cannot co-locate a Jool instance on a host that also serves as a traditional router/firewall, unless the IPv6 network Jool will translate to/from is 100% under the operator's control (so he can ascertain that there are no IPv6 MTUs lower than 1500).

danehans commented 7 years ago

I am trying to deploy Jool to an Ubuntu 16.04.2 VM on Google Compute Engine (GCE). GCE's MTU is 1460 and I am having performance problems related to fragmentation. I am trying to # curl -SL --retry 5 https://github.com/containernetworking/plugins/releases/download/v0.6.0-rc1/cni-plugins-amd64-v0.6.0-rc1.tgz > y.

When I use the IPv4 version of the curl command, performance is 1000x faster. This is what I see on the VM when I tcpdump the IPv6 version of the curl command:

<SNIP>
docker-user@k8s-dind:~$ sudo tcpdump -n -i ens4 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes
23:05:58.451493 IP 10.138.0.2 > 192.30.255.112: ICMP 10.138.0.2 unreachable - need to frag (mtu 1420), length 556
23:05:58.451515 IP 10.138.0.2 > 192.30.255.112: ICMP 10.138.0.2 unreachable - need to frag (mtu 1420), length 556
23:05:58.806472 IP 10.138.0.2 > 192.30.255.112: ICMP 10.138.0.2 unreachable - need to frag (mtu 1420), length 556
23:06:04.749526 IP 10.138.0.2 > 52.216.96.91: ICMP 10.138.0.2 unreachable - need to frag (mtu 1420), length 556
<SNIP>

$ sudo tcpdump -n -i ens4 port 443
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens4, link-type EN10MB (Ethernet), capture size 262144 bytes
23:06:53.101915 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 263445155:263445667, ack 1584654857, win 66, length 512
23:06:53.102024 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 512, win 1402, length 0
23:06:53.165651 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [P.], seq 1536:2560, ack 1, win 66, length 1024
23:06:53.165775 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 512, win 1402, options [nop,nop,sack 1 {1536:2560}], length 0
23:06:53.229433 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 512:1536, ack 1, win 66, length 1024
23:06:53.229493 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 2560:3072, ack 1, win 66, length 512
23:06:53.229566 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 2560, win 1395, length 0
23:06:53.229574 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 3072, win 1391, length 0
23:06:53.293455 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 3072:4608, ack 1, win 66, length 1536
23:06:53.562034 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 3072:3584, ack 1, win 66, length 512
23:06:53.562155 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 3584, win 1402, length 0
23:06:53.625957 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 4608:5632, ack 1, win 66, length 1024
23:06:53.626057 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 3584, win 1402, options [nop,nop,sack 1 {4608:5632}], length 0
23:06:53.689930 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 3584:4608, ack 1, win 66, length 1024
23:06:53.689994 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [P.], seq 5632:6144, ack 1, win 66, length 512
23:06:53.690089 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 5632, win 1395, length 0
23:06:53.690098 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 6144, win 1391, length 0
23:06:53.754089 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 6144:7680, ack 1, win 66, length 1536
23:06:54.022030 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 6144:6656, ack 1, win 66, length 512
23:06:54.022158 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 6656, win 1402, length 0
23:06:54.085834 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 7680:8704, ack 1, win 66, length 1024
23:06:54.085961 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 6656, win 1402, options [nop,nop,sack 1 {7680:8704}], length 0
23:06:54.149588 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 6656:7680, ack 1, win 66, length 1024
23:06:54.149649 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [P.], seq 8704:9216, ack 1, win 66, length 512
23:06:54.149722 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 8704, win 1395, length 0
23:06:54.149729 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 9216, win 1391, length 0
23:06:54.213405 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 9216:10752, ack 1, win 66, length 1536
23:06:54.482063 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 9216:9728, ack 1, win 66, length 512
23:06:54.482197 IP 10.138.0.2.7203 > 52.216.96.91.443: Flags [.], ack 9728, win 1402, length 0
23:06:54.546001 IP 52.216.96.91.443 > 10.138.0.2.7203: Flags [.], seq 10752:11776, ack 1, win 66, length 10

I have tried adding a mtu size: --mtu-plateaus: 65535,32000,17914,8166,4352,2002,1492,1460,1006,508,296,68

I have changed the mtu size of the Docker bridge, veth's and the container's eth0 interface but nothing is helping.

cc @diverdane @pmichali

ydahhrk commented 7 years ago

When I use the IPv4 version of the curl command, performance is 1000x faster.

I don't think that IP fragmentation should induce this level of catastrophe. Can it?

Quick check: Are you positive that there is no offloading going on? Because, at least on a quick look, this does seem like a typical case of GRO/LRO-induced black-holing.

If you’re running Jool in a guest virtual machine, something important to keep in mind is that you might rather or also have to disable offloads in the VM host’s uplink interface.

This might be relevant since you're running Jool in a contained environment.

danehans commented 7 years ago

@ydahhrk thank you for your response. I disabled offloads on the Ubuntu VM and in the container according to this doc but still no luck. As you mentioned, offloads should be disabled on the host's uplink ports, but my Ubuntu VM is running on a GCE host so that is not possible.

I have tested hosting the same tarball on a nginx container and curl -6 the tarball from the test client container. I see a 1/100 difference in transfer speed between a v6 GUA and a v6 synthesized address (sending traffic to Jool). Is that expected?


root@8fb1069ad999:/# curl -6SLO http://[fd00:dead:beef::3]/cni-plugins-amd64-v0.6.0-rc1.tgz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 14.6M  100 14.6M    0     0   916M      0 --:--:-- --:--:-- --:--:--  977M

root@8fb1069ad999:/# curl -6SLO http://[64:ff9b::172.18.0.4]/cni-plugins-amd64-v0.6.0-rc1.tgz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 14.6M  100 14.6M    0     0   705k      0  0:00:21  0:00:21 --:--:--  958k
``
leblancd commented 7 years ago

It appears that this fragmentation/performance issue only comes into play when you’re using Jool with only 1 interface for both (translated) IPv4 and (synthesized-IP) IPv6 traffic. When I use Jool with 1 interface, docker pulls are slow due to fragmentation, but when I use Jool with separate interfaces for IPv4 vs. IPv6, docker pulls go at normal speed. I believe that the issue is that when Jool is being used with a single interface, it may be applying the wrong “effective” MTU size for path MTU discovery. For example, if the interface MTU size is 1500, it represents that as its MTU for path MTU discovery… so a sender might send IPv4 packets up to 1480 in payload size (accounting for a 20-octet IPv4 header). However, when a 1480-octet packet gets translated by NAT64 to IPv6, the resulting packet size would be 1520 (1480 payload + 40 octet header), and this would violate the interface MTU, so NAT64 must respond back with a “needs fragmentation”, since it can’t fragment IPv6 packets. When Jool is used with 2 separate interfaces for IPv4 vs. IPv6, this issue is somehow avoided.

pmichali commented 7 years ago

We are using two interfaces in the lab and see the issue, but are wondering if it is what @leblancd is saying where the MTU calc is not accounting for the differences in header sizes. If I set the MTU of the host interface and docker to 9000, where no fragmenting occurs, it works. Granted, if packets exceed 8920, I suspect it will fail.

danehans commented 7 years ago

Jool continues to send ICMP unreachables even when the interface MTU's are configured properly. The ens4 interface connects to the GCE IPv4 network and br-2d4dca08dbf8 is the Docker bridge interface that goes to my test client container. I set br-2d4dca08dbf8 MTU to 1480 to account for the 20 byte increase of the IPv6 header.

docker-user@k8s-dind:~$ sudo ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 42:01:0a:8a:00:02 brd ff:ff:ff:ff:ff:ff
17: br-2d4dca08dbf8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:e5:b1:90:46 brd ff:ff:ff:ff:ff:ff

However, Jool continues to send icmp unreachables to the s3-1-w.amazonaws.com source. I do not see these messages for a working download, but they're continuous for the problematic download through Jool:

18:18:53.332380 IP k8s-dind.c.level-scheme-173421.internal > s3-1-w.amazonaws.com: ICMP k8s-dind.c.level-scheme-173421.internal unreachable - need to frag (mtu 1460), length 556

I see the same icmp errors with or without receive offloads:

$ sudo ethtool --show-offload ens4 | grep receive-offload
generic-receive-offload: off
large-receive-offload: off [fixed]

I can not disable offloads on the host since this is GCE.

ydahhrk commented 7 years ago

I see a 1/10 difference in transfer speed between a v6 GUA and a v6 synthesized address (sending traffic to Jool). Is that expected?

No. Indeed, I think there's something fishy going on here.

I believe that the issue is that when Jool is being used with a single interface, it may be applying the wrong “effective” MTU size for path MTU discovery. For example, if the interface MTU size is 1500, it represents that as its MTU for path MTU discovery… so a sender might send IPv4 packets up to 1480 in payload size (accounting for a 20-octet IPv4 header). However, when a 1480-octet packet gets translated by NAT64 to IPv6, the resulting packet size would be 1520 (1480 payload + 40 octet header), and this would violate the interface MTU, so NAT64 must respond back with a “needs fragmentation”, since it can’t fragment IPv6 packets.

Ok, I'm looking into this.

By the way: If anyone else wants to experiment, if one compiles Jool with debugging messages enabled, it will print some numbers that can help us figure out what it's trying to do.

For example, if Jool's attached interface is too small, it prints me this:

NAT64 Jool: Packet is too big (len: 1428, mtu: 1300)

While, on the other hand, it's translating an ICMP fragmentation needed error, it will print something like this:

NAT64 Jool: Packet MTU: 1300
NAT64 Jool: In dev MTU: 1500
NAT64 Jool: Out dev MTU: 1500
NAT64 Jool: Resulting MTU: 1280

("Packet MTU" is the MTU contained in the incoming ICMP error, "Resulting MTU" should be the number it prints in the resulting ICMP error.)

@danehans: Can you give me these numbers? They should reveal where Jool is messing up.

but are wondering if it is what @leblancd is saying where the MTU calc is not accounting for the differences in header sizes.

It should be accounting for these sizes. If I recall correctly, we have unit tests specifically intended to enforce this.

toreanderson commented 7 years ago

@danehans You should try to figure out what size the packet that is causing Jool to emit the ICMPv4 need to frag message is. If it's >1500 it is in all likelihood offloading that is causing your problem.

I can not disable offloads on the host since this is GCE.

That, I suspect, might be the root cause. If your VM gets oversized packets from the GCE host, I don't really see how to make it work.

ydahhrk commented 7 years ago

@danehans You should try to figure out what size the packet that is causing Jool to emit the ICMPv4 need to frag message is. If it's >1500 it is in all likelihood offloading that is causing your problem.

Yep. The first number in the following output will give you a rough estimate of the size of the packet Jool received:

NAT64 Jool: Packet is too big (len: 1428, mtu: 1300)

(It's the size of the outgoing packet, not the incoming one. That's why I say "rough estimate".)

I can not disable offloads on the host since this is GCE.

That, I suspect, might be the root cause. If your VM gets oversized packets from the GCE host, I don't really see how to make it work.

+1.

You know what? I think I see how to make it work.

Currently, the following happens in order:

  1. NIC offloads packet.
  2. defrag reassembles IP fragments.
  3. Jool translates.

The reason why Jool cannot handle offloads is because they require special treatment, but on some configurations offloaded packets look no different than IP-reassembled fragments. So Jool cannot handle them differently.

So how about creating a module that undoes offloading before the IP reassembly comes into play? This way, Jool will never have to fear mistaking an offload for a fragment.

  1. NIC offloads packet.
  2. Special Netfilter module (that kicks in before defrag) undoes the offload.
  3. defrag reassembles IP fragments.
  4. Jool translates.

Granted, it's a vile hack. It's not perfect because it requires an additional module hook, and there's a lot of unnecessary packet mangling along the way, but it's certainly much faster than suffering the black hole and can help when the NIC cannot be tweaked.

(Also, I need to check whether it's actually viable from code. I'm thinking of a particular Netfilter quirk that could get in the way.)

pmichali commented 7 years ago

In my host based config, I disabled GRO (LRO already disabled), and it appears to be working!

I’ll try the suggestions you have regarding Jool in container, and if that works, we can disable GRO on the VM for GRE and maybe things will work for that case.

Fingers crossed…

danehans commented 7 years ago

I am unable to compile the jool kernel module according to the provided document. Here are the details.

ydahhrk commented 7 years ago

I am unable to compile the jool kernel module according to the provided document. Here are the details.

Wait. How so? I don't see any errors in the output.

ydahhrk commented 7 years ago

Wait. Actually, I don't see the make install command. (Or make modules_install && depmod)

danehans commented 7 years ago

@ydahhrk I am getting module not founf when I try the modprobe step:

docker-user@k8s-dind:~/Jool-3.5.4/mod$ sudo modprobe -r jool
modprobe: FATAL: Module jool not found.
docker-user@k8s-dind:~/Jool-3.5.4/mod$ sudo modprobe -r jool_siit
modprobe: FATAL: Module jool_siit not found.
ydahhrk commented 7 years ago

sudo make modules_install && sudo depmod ;p

danehans commented 7 years ago

@ydahhrk $ sudo make modules_install && sudo depmod helped. Thanks.