Open mwiget opened 5 years ago
First attempt turned out to be easier than expected: https://github.com/mwiget/snabb/tree/passthru
Running lwaftr on-a-stick over 10GE loopback physical link to packetblaster (enhanced with passthru as well), while running iperf between both tap interfaces works.
Configured tap xe0 interface in the network namespace lwaftr, where snabb lwaftr will be run is configured with an IP address. Its MAC address is configured in the passthru-interface section. Note: This can also be a MAC address of a connecting device behind xe0, which is closer to the snabbvmx use case:
$ sudo ip netns exec lwaftr ifconfig
xe0: flags=4099<UP,BROADCAST,MULTICAST> mtu 9000
inet 10.9.9.2 netmask 255.255.255.0 broadcast 10.9.9.255
inet6 fe80::10fe:11ff:fef1:b41d prefixlen 64 scopeid 0x20<link>
ether 12:fe:11:f1:b4:1d txqueuelen 1000 (Ethernet)
RX packets 50 bytes 5606 (5.6 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 33 bytes 2462 (2.4 KB)
TX errors 0 dropped 1 overruns 0 carrier 0 collisions 0
The relevant snabb lwaftr configuration section:
instance {
device "03:00.0";
queue {
id 0;
external-interface {
ip 172.20.0.100;
mac 02:22:22:22:22:22;
next-hop {
mac 04:44:44:44:44:44;
}
}
internal-interface {
ip 2001:db8::100;
mac 02:22:22:22:22:22;
next-hop {
mac 04:44:44:44:44:44;
}
}
passthru-interface {
device xe0;
mac 12:fe:11:f1:b4:1d;
mtu 9000;
}
}
}
Pinging thru the interface while serving 3MPPS IMIX traffic:
$ sudo ip netns exec lwaftr tcpdump -n -e -i xe0
16:44:55.387944 12:fe:11:f1:b4:1d > 42:18:e3:a8:fa:80, ethertype ARP (0x0806), length 42: Request who-has 10.9.9.1 tell 10.9.9.2, length 28
16:44:55.388153 42:18:e3:a8:fa:80 > 12:fe:11:f1:b4:1d, ethertype ARP (0x0806), length 60: Request who-has 10.9.9.2 tell 10.9.9.1, length 46
16:44:55.388160 12:fe:11:f1:b4:1d > 42:18:e3:a8:fa:80, ethertype ARP (0x0806), length 42: Reply 10.9.9.2 is-at 12:fe:11:f1:b4:1d, length 28
16:44:55.388708 42:18:e3:a8:fa:80 > 12:fe:11:f1:b4:1d, ethertype ARP (0x0806), length 60: Reply 10.9.9.1 is-at 42:18:e3:a8:fa:80, length 46
16:44:56.316105 42:18:e3:a8:fa:80 > 12:fe:11:f1:b4:1d, ethertype IPv4 (0x0800), length 98: 10.9.9.1 > 10.9.9.2: ICMP echo request, id 1278, seq 7, length 64
16:44:56.316121 12:fe:11:f1:b4:1d > 42:18:e3:a8:fa:80, ethertype IPv4 (0x0800), length 98: 10.9.9.2 > 10.9.9.1: ICMP echo reply, id 1278, seq 7, length 64
16:44:57.340088 42:18:e3:a8:fa:80 > 12:fe:11:f1:b4:1d, ethertype IPv4 (0x0800), length 98: 10.9.9.1 > 10.9.9.2: ICMP echo request, id 1278, seq 8, length 64
16:44:57.340104 12:fe:11:f1:b4:1d > 42:18:e3:a8:fa:80, ethertype IPv4 (0x0800), length 98: 10.9.9.2 > 10.9.9.1: ICMP echo reply, id 1278, seq 8, length 64
v6+v4: 1.500+1.500 = 2.999946 MPPS, 4.462+3.982 = 8.443490 Gbps, lost 0.000%
v6+v4: 1.500+1.500 = 2.999663 MPPS, 4.461+3.981 = 8.442705 Gbps, lost 0.005%
v6+v4: 1.500+1.500 = 2.999976 MPPS, 4.462+3.982 = 8.443567 Gbps, lost 0.000%
v6+v4: 1.500+1.500 = 3.000023 MPPS, 4.462+3.982 = 8.443695 Gbps, lost 0.000%
v6+v4: 1.500+1.500 = 3.000009 MPPS, 4.462+3.982 = 8.443676 Gbps, lost 0.000%
v6+v4: 1.500+1.500 = 2.999997 MPPS, 4.462+3.982 = 8.443622 Gbps, lost 0.000%
Script to create the network namespace and Tap interfaces:
$ cat bench.sh
#!/bin/bash
NS=lwaftr
if [ -z "$(ip netns list|grep $NS)" ]; then
echo "creating netns $NS ..."
sudo ip netns add $NS
fi
ip netns list
echo "creating xe0 in netns $NS and default ..."
sudo ip netns exec $NS ip tuntap add dev xe0 mode tap
sudo ip netns exec $NS ifconfig xe0 mtu 9000 10.9.9.2/24 up
sudo ip tuntap add dev xe0 mode tap
sudo ifconfig xe0 mtu 9000 10.9.9.1/24 up
export MACPB=$(ip link show xe0|grep ether|awk '{print $2}')
export MACLW=$(sudo ip netns exec $NS ip link show xe0|grep ether|awk '{print $2}')
echo macpb=$MACPB maclw=$MACLW
envsubst < lwaftr.conf.template > lwaftr1.conf
The last command replaces the variable $MACLW in the lwaftr config template with the actual MAC address of the xe0 interface within the namespace.
To run lwaftr:
$ cat run-lwaftr1.sh
#!/bin/bash
NS=lwaftr
echo "launching iperf server in netns $NS"
sudo ip netns exec $NS iperf -s &
echo "launching lwaftr in netns $NS ..."
sudo ip netns exec $NS snabb/src/snabb lwaftr run --conf lwaftr1.conf -n lwaftr1
To run packetblaster:
$ cat run-packetblaster.sh
#!/bin/bash
MACPB=$(ip link show xe0|grep ether|awk '{print $2}')
sudo snabb/src/snabb packetblaster lwaftr \
--src_mac 04:44:44:44:44:44 \
--dst_mac 02:22:22:22:22:22 \
--pci 04:00.0 \
--pass_tap xe0 \
--pass_mac $MACPB \
--ipv4 172.20.9.100 \
--b4 2001:db8::100,193.5.1.100,1024 \
--aftr fc00::100 \
--count 63000 \
--rate 3 \
Once lwaftr and packetblaster are launched, one can ping xe0 behind lwaftr from the host:
ping 10.9.9.2
PING 10.9.9.2 (10.9.9.2) 56(84) bytes of data.
64 bytes from 10.9.9.2: icmp_seq=1 ttl=64 time=0.442 ms
64 bytes from 10.9.9.2: icmp_seq=2 ttl=64 time=0.213 ms
^C
--- 10.9.9.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1005ms
rtt min/avg/max/mdev = 0.213/0.327/0.442/0.115 ms
Changed from tap to RawSocket, allowing access to veth link endpoints (and linux network interfaces):
vMX is attached to veth link endpoints to LwAftr passthru, allowing not only the vMX to ping the lwaftr endpoints, it also passes BFD and BGP sessions thru snabb between L2L3 switch on the left and the vMX.
lab@vmx> ping 172.20.0.100
PING 172.20.0.100 (172.20.0.100): 56 data bytes
64 bytes from 172.20.0.100: icmp_seq=0 ttl=64 time=6.432 ms
^C
--- 172.20.0.100 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 6.432/6.432/6.432/0.000 ms
lab@vmx> ping 172.20.1.101
PING 172.20.1.101 (172.20.1.101): 56 data bytes
64 bytes from 172.20.1.101: icmp_seq=0 ttl=64 time=2.636 ms
64 bytes from 172.20.1.101: icmp_seq=1 ttl=64 time=6.520 ms
64 bytes from 172.20.1.101: icmp_seq=2 ttl=64 time=1.582 ms
^C
--- 172.20.1.101 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.582/3.579/6.520/2.123 ms
lab@vmx> show arp
MAC Address Address Name Interface Flags
4a:39:f0:00:58:93 128.0.0.16 fpc0 em1.0 none
02:42:70:d7:96:ea 172.17.0.1 172.17.0.1 fxp0.0 none
02:42:ac:11:00:04 172.17.0.4 172.17.0.4 fxp0.0 none
cc:2d:e0:d4:2d:aa 172.20.0.1 172.20.0.1 ge-0/0/0.0 none
02:22:22:22:22:00 172.20.0.100 172.20.0.100 ge-0/0/0.0 none
cc:2d:e0:d4:2d:aa 172.20.1.1 172.20.1.1 ge-0/0/1.0 none
02:22:22:22:22:01 172.20.1.101 172.20.1.101 ge-0/0/1.0 none
Total entries: 7
lab@vmx> show bfd session
Detect Transmit
Address State Interface Time Interval Multiplier
172.20.0.1 Up ge-0/0/0.0 1.000 0.200 3
172.20.1.1 Up ge-0/0/1.0 1.000 0.200 3
2001:db8::1 Up ge-0/0/0.0 1.000 0.200 3
2001:db8:1::1 Up ge-0/0/1.0 1.000 0.200 3
4 sessions, 4 clients
Cumulative transmit rate 20.0 pps, cumulative receive rate 20.0 pps
lab@vmx> show bgp summary
Groups: 1 Peers: 4 Down peers: 0
Table Tot Paths Act Paths Suppressed History Damp State Pending
inet.0
6 1 0 0 0 0
inet6.0
4 0 0 0 0 0
Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
172.20.0.1 65530 316 297 0 6 2:15:33 Establ
inet.0: 1/3/3/0
172.20.1.1 65530 314 298 0 5 2:15:37 Establ
inet.0: 0/3/3/0
2001:db8::1 65530 318 298 0 1 2:15:41 Establ
inet6.0: 0/2/2/0
2001:db8:1::1 65530 316 298 0 1 2:15:45 Establ
inet6.0: 0/2/
lab@vmx> show ipv6 neighbors
IPv6 Address Linklayer Address State Exp Rtr Secure Interface
2001:db8::1 cc:2d:e0:d4:2d:aa reachable 32 yes no ge-0/0/0.0
2001:db8::100 02:22:22:22:22:00 reachable 27 yes no ge-0/0/0.0
2001:db8:1::1 cc:2d:e0:d4:2d:aa reachable 25 yes no ge-0/0/1.0
2001:db8:1::101 02:22:22:22:22:01 stale 302 yes no ge-0/0/1.0
fe80::8cf:3c3d:c32a:6684 88:e9:fe:4f:5c:01 stale 322 no no ge-0/0/0.0
fe80::8cf:3c3d:c32a:6684 88:e9:fe:4f:5c:01 stale 359 no no ge-0/0/1.0
fe80::2cbe:a6ff:fe1e:b67f 2e:be:a6:1e:b6:7f stale 458 no no ge-0/0/1.0
fe80::3a10:d5ff:fe69:b323 38:10:d5:69:b3:23 stale 1002 yes no ge-0/0/0.0
fe80::3a10:d5ff:fe69:b323 38:10:d5:69:b3:23 stale 191 yes no ge-0/0/1.0
fe80::4ce6:34ff:feb3:46cf 4e:e6:34:b3:46:cf stale 67 no no ge-0/0/0.0
fe80::ce2d:e0ff:fed4:2daa cc:2d:e0:d4:2d:aa stale 298 yes no ge-0/0/1.0
fe80::ce2d:e0ff:fed4:2daa cc:2d:e0:d4:2d:aa stale 547 yes no ge-0/0/0.0
Total entries: 12
The relevant interface config for lwaftr:
instance {
device "01:00.0";
queue {
id 0;
external-interface {
ip 172.20.0.100;
mac 02:22:22:22:22:00;
next-hop {
ip 172.20.0.1;
}
}
internal-interface {
ip 2001:db8::100;
mac 02:22:22:22:22:00;
next-hop {
ip 2001:db8::1;
}
}
passthru-interface {
device int0;
mac 6a:e3:09:2c:06:10;
mtu 9000;
}
}
}
instance {
device "01:00.1";
queue {
id 0;
external-interface {
ip 172.20.1.101;
mac 02:22:22:22:22:01;
next-hop {
ip 172.20.1.1;
}
}
internal-interface {
ip 2001:db8:1::101;
mac 02:22:22:22:22:01;
next-hop {
ip 2001:db8:1::1;
}
}
passthru-interface {
device int1;
mac 12:36:28:fd:01:1f;
mtu 9000;
}
}
}
You can see the vMX learning the provisioned IPv4 and IPv6 LwAftr interfaces. ICMP pings travel from vMX thru Snabb and via the L2L3 switch back into Snabb.
I'm sharing this brain dump to collect feedback while working on a prototype.
Snabbvmx offered a unique way to pass control traffic to an attached VM (e.g. Juniper vMX) to handle all network control traffic, from ARP, IPv6 NDP, BFD, LLDP and routing protocols like ISIS, OSPF and BGP. Snabbvmx also offered next-hop resolution for processed packets by periodically sending a copy of the packet to the VM via the VhostUser interface. vmx-docker-lwaftr uses snabbvmx.
lwAFTR evolved dramatically since snabbvmx was introduced. A YANG file is now configuring more than one instances serving network interfaces and more and more basic network functions like ARP and ICMP have been implemented. This allows for alternate deployment options, where multiple interfaces can be served as slaves "on a stick" from a master snabb instance. Availability and routing of traffic towards each snabb instance can be handled via an "out-of-band" BGP session between the connecting L2/L3 switch and a route server, e.g. ExaBGP or Junos RR. (See Video of crr-snabb-lwaftr prototype).
But this design requires an additional interface for the BGP session between switch and the BGP process and something must monitor the health of each lwAFTR instance and update the routing table. Snabbvmx solved this by "fate sharing" of date and control plane thru each interface.
This brings me to the main idea: Pass control traffic (like BGP) thru one or more interfaces of lwAFTR to an auxiliary TAP interface. LwAFTR runs in hairpinning mode.
These auxiliary interfaces are passing thru traffic between the L2/L3 switch and an attached route reflector. The L2/L3 switch, lwAFTR and the route reflector all have unique IPv4 and IPv6 addresses and MAC addresses per link. IMHO LwAFTR doesn't need to know the route reflectors IP address.
Discussion: It is probably sufficient to know its MAC address. Though it is tempting to ignore the MAC address by simply passing BUM (Broadcast & Unknown Multicast) traffic thru, together with packets on the same subnet. But that would make it impossible to use loopback IP addresses. I'll most likely stick to telling lwAFTR the MAC address of the connected route reflector via configuration.
Configuration
The YANG software-config has already containers for internal-interface and external-interface. This shall be extended with a passthru-interface container. The following example illustrates the idea:
There is no need to specify an IP address for the passhtru-interface; forwarding decisions are made at L2 based on the MAC address and optional VLAN towards the TAP interface. The optional leaf's promiscuous and sample-rate are an idea to use the same tap interface for monitoring purposes and merits further investigation.
Next-hop resolution: snabbvmx uses the nh_fwd app to periodically pass a copy of an egress packet to the connected vMX VM. IMHO it is sufficient to specify the IPv4 and IPv6 next-hop address in the lwAFTR config and have the route reflector interact with Snabb config to learn about the active binding table entries and combine them with the internal- and external-interface IP addresses to create routing entries. By having multiple instances, each connected to a separate interface of the L2/L3 switch offers redundancy.