Consistent Packet Loss Running MoonGen in VM

james-jra commented 7 years ago

System

Ubuntu 14.04 VM running on OpenStack Mikata Using SR-IOV and 1 Intel X520 NICs (total of 2 physical 10GEth ports) Latest version on MoonGen.

Topology

For the purposes of this test, I used a single VM with 2 vNICs, 1 for each physical port, connected by physical cable.

Issue

In simple tests of transmitting from one interface, and receiving at another over a physical cable, MoonGen's counters report a packet loss when the transmission rate is above ~ 0.5 million packets per second. The packet loss is less significant when not using software rate control, but is still present. I imagine this difference is due to the fact software rate control really transmits in short line-rate bursts. (Hardware rate control is not supported on this device driver, but just sends at some rate).

Have you experienced similar packet loss?
Can you suggest how we might go about debugging this?
Are there any performance tweaks we could make to prevent this?

This packet_loss.lua.txt test script was used.

Example MoonGen trace for back-to-back transmission. This test used software rate-control, sending at 0.6 Mpps over 10 seconds, the receiver hangs until no packets were received in a 5 second period. It showed around 2000 packets lost over a 10 second run:

$ sudo ./build/MoonGen examples/packet_loss.lua 4 0 10 -r 0.6
[WARN]  malloc() allocates objects >= 1 MiB from LuaJIT memory space.
[WARN]  Install libjemalloc if you encounter out of memory errors.
[INFO]  Initializing DPDK. This will take a few seconds...
EAL: Detected 12 lcore(s)
EAL: Probing VFIO support...
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
EAL: PCI device 0000:00:03.0 on NUMA socket -1
EAL:   probe driver: 1af4:1000 rte_virtio_pmd
EAL: PCI device 0000:00:05.0 on NUMA socket -1
EAL:   probe driver: 8086:10ed rte_ixgbevf_pmd
EAL: PCI device 0000:00:06.0 on NUMA socket -1
EAL:   probe driver: 8086:10ed rte_ixgbevf_pmd
...
[INFO]  Found 8 usable devices:
   Device 0: FA:16:3E:B4:1A:79 (Intel Corporation 82599 Ethernet Controller Virtual Function)
   ...
   Device 4: FA:16:3E:DF:54:F3 (Intel Corporation 82599 Ethernet Controller Virtual Function)
   ...
[INFO]  Waiting for devices to come up...
[INFO]  Device 4 (FA:16:3E:DF:54:F3) is up: 10000 MBit/s
[INFO]  Device 0 (FA:16:3E:B4:1A:79) is up: 10000 MBit/s
[INFO]  2 devices are up.
[INFO]  Tx Dev  MAC: FA:16:3E:DF:54:F3  Driver: rte_ixgbevf_pmd
[INFO]  Rx Dev  MAC: FA:16:3E:B4:1A:79  Driver: rte_ixgbevf_pmd
[INFO]  Sending using software rate-control at 0.6 Mpps
[INFO]  Finalizing TX Counter
[TX_CTR] TX:  total 6017152 packets with 746126848 bytes (incl. CRC)
[INFO]  Finalizing RX Counter
[RX_CTR] RX:  total 6015209 packets with 745885916 bytes (incl. CRC)

emmericp commented 7 years ago

That's an interesting problem, I suspect that the counters are wrong because we don't have an explicit implementation of statistics for the ixgbevf driver, so it falls back to the DPDK implementation.

The DPDK stats are unfortunately vastly inconsistent between different drivers, IIRC ixgbe didn't count "missed" packets properly in DPDK which is why we re-implemented this part.

I unfortunately don't have a SR-IOV setup here at the moment, so I can't reproduce this at the moment. But I've wanted to take a deeper look at SR-IOV in the near future anyways. Might be able to test it without a VM but with SR-IOV next week.

A few things to try:

Can you try to manually count the packets sent and received? queue:send() also returns the number of packets transmitted. Do the counters add up to what the device is reporting? How does this vary with the rate?
MoonGen software rate control does not send small bursts, that's the whole point of it ;) This also means that this is the slowest way to send packets (and it stresses the NIC/PCI bus the most, in fact, the 82599 can only handle about 10 Mpps with this). Can you try to run a simple burst-based rate control? Just send a burst and sleep (mg.sleepMicros(micros) which directly calls rte_delay_us()) afterwards in the loop.
Does this also happen when using SR-IOV without a VM?
Does this also happen when using a PF instead of a VF in a VM?

james-jra commented 7 years ago

Thanks for getting back to me, and providing some suggestions for looking into this.

I've since run some more tests corresponding to your suggestions. The topology for tests 1 & 2 used one VM with 2 vNICs, one from each physical 10GEth port, which are connected by physical cable. Test 5 used 2 separate VMs with one vNIC each. All tests ran for 30 seconds and measured total TX packets and total RX packets.

Manually counting packets gives the same results as reported by the device. (For this, I used the values returned by queue:send() on the tx-side and queue:tryRecv() on the rx-side). This confirms your assertion that the RX device is not counting missed packets rte_eth_stats_get() always returns imissed = 0
Right you are, I'd stopped reading after seeing rte_eth_tx_burst in src/software-rate-limiter.cpp and didn't realise you were only passing a single packet. Running simple burst rate control (queue:send() followed by mg.sleepMicros()) gave lesser packet loss than software rate-limited, and was linear with requested rate. All counter types were consistent in this test. I've attached a graph to show dependence of % packet loss on offered rate.

packet_loss

3&4. It is not currently convenient to make the necessary changes to our OpenStack rig to investigate these.

I wrote a simple tight-loop in C using DPDK to poll a receive queue from a second VM and used this instead of MoonGen as my endpoint to repeat the first 2 tests. main.c.txt

This showed zero packet loss across the whole test range in both cases. This confirms that packets were indeed being lost on the RX side of transmission.

My interpretation of these results is that the RX queue's ring buffer was being filled, and then packets were dropped (but not reported due to the previously mentioned stats unreliability for the ixgbevf driver). The solution to this is therefore to poll the ring buffer more frequently (as in the C receive tight-loop), or increase the size of the ring buffer (by passing a large rxDescs value when configuring the device).

As I understand it, I shouldn't expect to be dropping packets at the rates tested above, so I think that using SR-IOV vfs and/or running in a VM must have some performance impact too.

emmericp commented 7 years ago

I read through the relevant parts of the 82599 and X550 datasheets and there is simply no counter for missed packets in a VF :(

So you will have to receive every single packet from the VF to count them, no way around that, unfortunately

A few observations:

The default DPDK-based stats counters seem to work reasonably well; it's just that the VF barely offers any statistics. The packet size is just off by CRC because it doesn't count CRC (which is weird because the PF counts CRC)
Hardware rate control is not supported by the hardware, I get a nice and clean warning that it doesn't work
Rate limits set by the PF work reasonably well, there just seem to be some problems
Rate limit accounts for VLAN tags configured for the VF

You still shouldn't see packet loss, though. Will run some more tests later

emmericp / MoonGen