Igalia / snabb

Snabb Switch: Fast open source packet processing
Apache License 2.0
47 stars 5 forks source link

[wip] passthru-interface support #1176

Open mwiget opened 5 years ago

mwiget commented 5 years ago

I'm sharing this brain dump to collect feedback while working on a prototype.

Snabbvmx offered a unique way to pass control traffic to an attached VM (e.g. Juniper vMX) to handle all network control traffic, from ARP, IPv6 NDP, BFD, LLDP and routing protocols like ISIS, OSPF and BGP. Snabbvmx also offered next-hop resolution for processed packets by periodically sending a copy of the packet to the VM via the VhostUser interface. vmx-docker-lwaftr uses snabbvmx.

lwAFTR evolved dramatically since snabbvmx was introduced. A YANG file is now configuring more than one instances serving network interfaces and more and more basic network functions like ARP and ICMP have been implemented. This allows for alternate deployment options, where multiple interfaces can be served as slaves "on a stick" from a master snabb instance. Availability and routing of traffic towards each snabb instance can be handled via an "out-of-band" BGP session between the connecting L2/L3 switch and a route server, e.g. ExaBGP or Junos RR. (See Video of crr-snabb-lwaftr prototype).

crr-snabb-lwaftr

But this design requires an additional interface for the BGP session between switch and the BGP process and something must monitor the health of each lwAFTR instance and update the routing table. Snabbvmx solved this by "fate sharing" of date and control plane thru each interface.

This brings me to the main idea: Pass control traffic (like BGP) thru one or more interfaces of lwAFTR to an auxiliary TAP interface. LwAFTR runs in hairpinning mode.

lwaftr-passthru-example

These auxiliary interfaces are passing thru traffic between the L2/L3 switch and an attached route reflector. The L2/L3 switch, lwAFTR and the route reflector all have unique IPv4 and IPv6 addresses and MAC addresses per link. IMHO LwAFTR doesn't need to know the route reflectors IP address.

Discussion: It is probably sufficient to know its MAC address. Though it is tempting to ignore the MAC address by simply passing BUM (Broadcast & Unknown Multicast) traffic thru, together with packets on the same subnet. But that would make it impossible to use loopback IP addresses. I'll most likely stick to telling lwAFTR the MAC address of the connected route reflector via configuration.

Configuration

The YANG software-config has already containers for internal-interface and external-interface. This shall be extended with a passthru-interface container. The following example illustrates the idea:

    softwire-config {
      instance {
        device "00:05.0"
        queue {
          id 0;
          external-interface {
            ip 10.10.10.10;
            mac 12:12:12:12:12:12;
            // vlan-tag 42;
            next-hop {
              ip 1.2.3.4;
            }
          }
          internal-interface {
            ip 8:9:a:b:c:d:e:f;
            mac 22:22:22:22:22:22;
            // vlan-tag 64;
            next-hop {
              ip 7:8:9:a:b:c:d:e;
            }
          }
          passthru-interface {
            name "xe0";
            mac 44:44:44:44:44:44;
            // vlan-tag 64;
            // promiscuous false;
            // sample-rate 1000;
          }
        }
      }
    }

There is no need to specify an IP address for the passhtru-interface; forwarding decisions are made at L2 based on the MAC address and optional VLAN towards the TAP interface. The optional leaf's promiscuous and sample-rate are an idea to use the same tap interface for monitoring purposes and merits further investigation.

Next-hop resolution: snabbvmx uses the nh_fwd app to periodically pass a copy of an egress packet to the connected vMX VM. IMHO it is sufficient to specify the IPv4 and IPv6 next-hop address in the lwAFTR config and have the route reflector interact with Snabb config to learn about the active binding table entries and combine them with the internal- and external-interface IP addresses to create routing entries. By having multiple instances, each connected to a separate interface of the L2/L3 switch offers redundancy.

mwiget commented 5 years ago

First attempt turned out to be easier than expected: https://github.com/mwiget/snabb/tree/passthru

Running lwaftr on-a-stick over 10GE loopback physical link to packetblaster (enhanced with passthru as well), while running iperf between both tap interfaces works.

Configured tap xe0 interface in the network namespace lwaftr, where snabb lwaftr will be run is configured with an IP address. Its MAC address is configured in the passthru-interface section. Note: This can also be a MAC address of a connecting device behind xe0, which is closer to the snabbvmx use case:

$ sudo ip netns exec lwaftr ifconfig
xe0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 9000
        inet 10.9.9.2  netmask 255.255.255.0  broadcast 10.9.9.255
        inet6 fe80::10fe:11ff:fef1:b41d  prefixlen 64  scopeid 0x20<link>
        ether 12:fe:11:f1:b4:1d  txqueuelen 1000  (Ethernet)
        RX packets 50  bytes 5606 (5.6 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 33  bytes 2462 (2.4 KB)
        TX errors 0  dropped 1 overruns 0  carrier 0  collisions 0

The relevant snabb lwaftr configuration section:

  instance {
    device "03:00.0";
    queue {
      id 0;
      external-interface {
        ip 172.20.0.100;
        mac 02:22:22:22:22:22;
        next-hop {
          mac 04:44:44:44:44:44;
        }
      }
      internal-interface {
        ip 2001:db8::100;
        mac 02:22:22:22:22:22;
        next-hop {
          mac 04:44:44:44:44:44;
        }
      }
      passthru-interface {
         device xe0;
         mac 12:fe:11:f1:b4:1d;
         mtu 9000;
      }
    }
  }

Pinging thru the interface while serving 3MPPS IMIX traffic:

$ sudo ip netns exec lwaftr tcpdump -n -e -i xe0
16:44:55.387944 12:fe:11:f1:b4:1d > 42:18:e3:a8:fa:80, ethertype ARP (0x0806), length 42: Request who-has 10.9.9.1 tell 10.9.9.2, length 28
16:44:55.388153 42:18:e3:a8:fa:80 > 12:fe:11:f1:b4:1d, ethertype ARP (0x0806), length 60: Request who-has 10.9.9.2 tell 10.9.9.1, length 46
16:44:55.388160 12:fe:11:f1:b4:1d > 42:18:e3:a8:fa:80, ethertype ARP (0x0806), length 42: Reply 10.9.9.2 is-at 12:fe:11:f1:b4:1d, length 28
16:44:55.388708 42:18:e3:a8:fa:80 > 12:fe:11:f1:b4:1d, ethertype ARP (0x0806), length 60: Reply 10.9.9.1 is-at 42:18:e3:a8:fa:80, length 46
16:44:56.316105 42:18:e3:a8:fa:80 > 12:fe:11:f1:b4:1d, ethertype IPv4 (0x0800), length 98: 10.9.9.1 > 10.9.9.2: ICMP echo request, id 1278, seq 7, length 64
16:44:56.316121 12:fe:11:f1:b4:1d > 42:18:e3:a8:fa:80, ethertype IPv4 (0x0800), length 98: 10.9.9.2 > 10.9.9.1: ICMP echo reply, id 1278, seq 7, length 64
16:44:57.340088 42:18:e3:a8:fa:80 > 12:fe:11:f1:b4:1d, ethertype IPv4 (0x0800), length 98: 10.9.9.1 > 10.9.9.2: ICMP echo request, id 1278, seq 8, length 64
16:44:57.340104 12:fe:11:f1:b4:1d > 42:18:e3:a8:fa:80, ethertype IPv4 (0x0800), length 98: 10.9.9.2 > 10.9.9.1: ICMP echo reply, id 1278, seq 8, length 64
v6+v4: 1.500+1.500 = 2.999946 MPPS, 4.462+3.982 = 8.443490 Gbps, lost 0.000%
v6+v4: 1.500+1.500 = 2.999663 MPPS, 4.461+3.981 = 8.442705 Gbps, lost 0.005%
v6+v4: 1.500+1.500 = 2.999976 MPPS, 4.462+3.982 = 8.443567 Gbps, lost 0.000%
v6+v4: 1.500+1.500 = 3.000023 MPPS, 4.462+3.982 = 8.443695 Gbps, lost 0.000%
v6+v4: 1.500+1.500 = 3.000009 MPPS, 4.462+3.982 = 8.443676 Gbps, lost 0.000%
v6+v4: 1.500+1.500 = 2.999997 MPPS, 4.462+3.982 = 8.443622 Gbps, lost 0.000%

Script to create the network namespace and Tap interfaces:

$ cat bench.sh
#!/bin/bash
NS=lwaftr

if [ -z "$(ip netns list|grep $NS)" ]; then
   echo "creating netns $NS ..."
   sudo ip netns add $NS
fi

ip netns list

echo "creating xe0 in netns $NS and default ..."
sudo ip netns exec $NS ip tuntap add dev xe0 mode tap
sudo ip netns exec $NS ifconfig xe0 mtu 9000 10.9.9.2/24 up
sudo ip tuntap add dev xe0 mode tap
sudo ifconfig xe0 mtu 9000 10.9.9.1/24 up

export MACPB=$(ip link show xe0|grep ether|awk '{print $2}')
export MACLW=$(sudo ip netns exec $NS ip link show xe0|grep ether|awk '{print $2}')
echo macpb=$MACPB maclw=$MACLW
envsubst < lwaftr.conf.template > lwaftr1.conf

The last command replaces the variable $MACLW in the lwaftr config template with the actual MAC address of the xe0 interface within the namespace.

To run lwaftr:

$ cat run-lwaftr1.sh
#!/bin/bash
NS=lwaftr
echo "launching iperf server in netns $NS"
sudo ip netns exec $NS iperf -s &
echo "launching lwaftr in netns $NS ..."
sudo ip netns exec $NS snabb/src/snabb lwaftr run --conf lwaftr1.conf -n lwaftr1

To run packetblaster:

$ cat run-packetblaster.sh
#!/bin/bash
MACPB=$(ip link show xe0|grep ether|awk '{print $2}')
sudo snabb/src/snabb packetblaster lwaftr \
   --src_mac 04:44:44:44:44:44 \
   --dst_mac 02:22:22:22:22:22 \
   --pci 04:00.0 \
   --pass_tap xe0 \
   --pass_mac $MACPB \
   --ipv4 172.20.9.100 \
   --b4 2001:db8::100,193.5.1.100,1024 \
   --aftr fc00::100 \
   --count 63000 \
   --rate 3 \

Once lwaftr and packetblaster are launched, one can ping xe0 behind lwaftr from the host:

 ping 10.9.9.2
PING 10.9.9.2 (10.9.9.2) 56(84) bytes of data.
64 bytes from 10.9.9.2: icmp_seq=1 ttl=64 time=0.442 ms
64 bytes from 10.9.9.2: icmp_seq=2 ttl=64 time=0.213 ms
^C
--- 10.9.9.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1005ms
rtt min/avg/max/mdev = 0.213/0.327/0.442/0.115 ms
mwiget commented 5 years ago

Changed from tap to RawSocket, allowing access to veth link endpoints (and linux network interfaces):

screenshot 2018-10-22 09 01 10

vMX is attached to veth link endpoints to LwAftr passthru, allowing not only the vMX to ping the lwaftr endpoints, it also passes BFD and BGP sessions thru snabb between L2L3 switch on the left and the vMX.

lab@vmx> ping 172.20.0.100
PING 172.20.0.100 (172.20.0.100): 56 data bytes
64 bytes from 172.20.0.100: icmp_seq=0 ttl=64 time=6.432 ms
^C
--- 172.20.0.100 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 6.432/6.432/6.432/0.000 ms

lab@vmx> ping 172.20.1.101
PING 172.20.1.101 (172.20.1.101): 56 data bytes
64 bytes from 172.20.1.101: icmp_seq=0 ttl=64 time=2.636 ms
64 bytes from 172.20.1.101: icmp_seq=1 ttl=64 time=6.520 ms
64 bytes from 172.20.1.101: icmp_seq=2 ttl=64 time=1.582 ms
^C
--- 172.20.1.101 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.582/3.579/6.520/2.123 ms

lab@vmx> show arp
MAC Address       Address         Name                      Interface               Flags
4a:39:f0:00:58:93 128.0.0.16      fpc0                      em1.0                   none
02:42:70:d7:96:ea 172.17.0.1      172.17.0.1                fxp0.0                  none
02:42:ac:11:00:04 172.17.0.4      172.17.0.4                fxp0.0                  none
cc:2d:e0:d4:2d:aa 172.20.0.1      172.20.0.1                ge-0/0/0.0              none
02:22:22:22:22:00 172.20.0.100    172.20.0.100              ge-0/0/0.0              none
cc:2d:e0:d4:2d:aa 172.20.1.1      172.20.1.1                ge-0/0/1.0              none
02:22:22:22:22:01 172.20.1.101    172.20.1.101              ge-0/0/1.0              none
Total entries: 7

lab@vmx> show bfd session
                                                  Detect   Transmit
Address                  State     Interface      Time     Interval  Multiplier
172.20.0.1               Up        ge-0/0/0.0     1.000     0.200        3
172.20.1.1               Up        ge-0/0/1.0     1.000     0.200        3
2001:db8::1              Up        ge-0/0/0.0     1.000     0.200        3
2001:db8:1::1            Up        ge-0/0/1.0     1.000     0.200        3

4 sessions, 4 clients
Cumulative transmit rate 20.0 pps, cumulative receive rate 20.0 pps

lab@vmx> show bgp summary
Groups: 1 Peers: 4 Down peers: 0
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0
                       6          1          0          0          0          0
inet6.0
                       4          0          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
172.20.0.1            65530        316        297       0       6     2:15:33 Establ
  inet.0: 1/3/3/0
172.20.1.1            65530        314        298       0       5     2:15:37 Establ
  inet.0: 0/3/3/0
2001:db8::1           65530        318        298       0       1     2:15:41 Establ
  inet6.0: 0/2/2/0
2001:db8:1::1         65530        316        298       0       1     2:15:45 Establ
  inet6.0: 0/2/

lab@vmx> show ipv6 neighbors
IPv6 Address                  Linklayer Address  State       Exp   Rtr  Secure  Interface
2001:db8::1                   cc:2d:e0:d4:2d:aa  reachable   32    yes  no      ge-0/0/0.0
2001:db8::100                 02:22:22:22:22:00  reachable   27    yes  no      ge-0/0/0.0
2001:db8:1::1                 cc:2d:e0:d4:2d:aa  reachable   25    yes  no      ge-0/0/1.0
2001:db8:1::101               02:22:22:22:22:01  stale       302   yes  no      ge-0/0/1.0
fe80::8cf:3c3d:c32a:6684      88:e9:fe:4f:5c:01  stale       322   no   no      ge-0/0/0.0
fe80::8cf:3c3d:c32a:6684      88:e9:fe:4f:5c:01  stale       359   no   no      ge-0/0/1.0
fe80::2cbe:a6ff:fe1e:b67f     2e:be:a6:1e:b6:7f  stale       458   no   no      ge-0/0/1.0
fe80::3a10:d5ff:fe69:b323     38:10:d5:69:b3:23  stale       1002  yes  no      ge-0/0/0.0
fe80::3a10:d5ff:fe69:b323     38:10:d5:69:b3:23  stale       191   yes  no      ge-0/0/1.0
fe80::4ce6:34ff:feb3:46cf     4e:e6:34:b3:46:cf  stale       67    no   no      ge-0/0/0.0
fe80::ce2d:e0ff:fed4:2daa     cc:2d:e0:d4:2d:aa  stale       298   yes  no      ge-0/0/1.0
fe80::ce2d:e0ff:fed4:2daa     cc:2d:e0:d4:2d:aa  stale       547   yes  no      ge-0/0/0.0
Total entries: 12

The relevant interface config for lwaftr:

  instance {
    device "01:00.0";
    queue {
      id 0;
      external-interface {
        ip 172.20.0.100;
        mac 02:22:22:22:22:00;
        next-hop {
          ip 172.20.0.1;
        }
      }
      internal-interface {
        ip 2001:db8::100;
        mac 02:22:22:22:22:00;
        next-hop {
          ip 2001:db8::1;
        }
      }
      passthru-interface {
        device int0;
        mac 6a:e3:09:2c:06:10;
        mtu 9000;
      }
    }
  }
  instance {
    device "01:00.1";
    queue {
      id 0;
      external-interface {
        ip 172.20.1.101;
        mac 02:22:22:22:22:01;
        next-hop {
          ip 172.20.1.1;
        }
      }
      internal-interface {
        ip 2001:db8:1::101;
        mac 02:22:22:22:22:01;
        next-hop {
          ip 2001:db8:1::1;
        }
      }
      passthru-interface {
        device int1;
        mac 12:36:28:fd:01:1f;
        mtu 9000;
      }
    }
  }

You can see the vMX learning the provisioned IPv4 and IPv6 LwAftr interfaces. ICMP pings travel from vMX thru Snabb and via the L2L3 switch back into Snabb.