Intel XL710 40Gbps saturation

marcofaltelli commented 3 years ago

Hi, I'm trying to saturate a XL710 Intel NIC with 64B packets. On a single core I manage to obtain 21Mpps (which is 11Gbps). From your paper I understood that these NICs can get up to 22Gbps w/ 64B packets, so I tried to create multiple sender slaves on different cores. The results are kind of strange: I get around 13Mpps received in total, but that's also the number that every Tx queue statistics tells me, even if in my code I've created three different Tx counters, one for every Tx queue (see later).


[Device: id=1] RX: 12.96 Mpps, 6636 Mbit/s (8710 Mbit/s with framing)
[Device: id=0] TX: 12.96 Mpps, 6637 Mbit/s (8710 Mbit/s with framing)
[Device: id=0] TX: 12.96 Mpps, 6636 Mbit/s (8710 Mbit/s with framing)
[Device: id=0] TX: 12.96 Mpps, 6636 Mbit/s (8710 Mbit/s with framing)
[Device: id=1] RX: 12.99 Mpps, 6652 Mbit/s (8730 Mbit/s with framing)
[Device: id=0] TX: 12.99 Mpps, 6653 Mbit/s (8732 Mbit/s with framing)
[Device: id=0] TX: 12.99 Mpps, 6653 Mbit/s (8732 Mbit/s with framing)
[Device: id=0] TX: 12.99 Mpps, 6653 Mbit/s (8732 Mbit/s with framing)
[Device: id=1] RX: 12.95 Mpps, 6632 Mbit/s (8704 Mbit/s with framing)
[Device: id=0] TX: 12.95 Mpps, 6631 Mbit/s (8703 Mbit/s with framing)
[Device: id=0] TX: 12.95 Mpps, 6631 Mbit/s (8703 Mbit/s with framing)
[Device: id=0] TX: 12.95 Mpps, 6631 Mbit/s (8703 Mbit/s with framing)

My master and slave functions are as follows. They are taken from this test of the software-switches suite.

function master(args)
    txDev = device.config{port = args.txDev, rxQueues = 4, txQueues = 4}
    rxDev = device.config{port = args.rxDev, rxQueues = 4, txQueues = 4}
    device.waitForLinks()
    -- max 1kpps timestamping traffic timestamping
    -- rate will be somewhat off for high-latency links at low rates
    if args.rate > 0 then
        txDev:getTxQueue(0):setRate(args.rate - (args.size + 4) * 8 / 1000)
        txDev:getTxQueue(1):setRate(args.rate - (args.size + 4) * 8 / 1000)
        txDev:getTxQueue(3):setRate(args.rate - (args.size + 4) * 8 / 1000)
    end
    rxDev:getTxQueue(0).dev:UdpGenericFilter(rxDev:getRxQueue(3))

    mg.startTask("loadSlave", txDev:getTxQueue(0), rxDev, args.size)
    mg.startTask("loadSlave", txDev:getTxQueue(1), rxDev, args.size)
    mg.startTask("loadSlave", txDev:getTxQueue(3), rxDev, args.size)
    mg.startTask("receiveSlave", rxDev:getRxQueue(3), rxDev, args.size)
    mg.waitForTasks()
end

function loadSlave(queue, rxDev, size)

    log:info(green("Starting up: LoadSlave"))

    -- retrieve the number of xstats on the recieving NIC
    -- xstats related C definitions are in device.lua
    local numxstats = 0
        local xstats = ffi.new("struct rte_eth_xstat[?]", numxstats)

    -- because there is no easy function which returns the number of xstats we try to retrieve
    -- the xstats with a zero sized array
    -- if result > numxstats (0 in our case), then result equals the real number of xstats
    local result = C.rte_eth_xstats_get(rxDev.id, xstats, numxstats)
    numxstats = tonumber(result)

    local mempool = memory.createMemPool(function(buf)
        fillUdpPacket(buf, size)
    end)
    local bufs = mempool:bufArray()
    local txCtr = stats:newDevTxCounter(queue, "plain")
    local baseIP = parseIPAddress(SRC_IP_BASE)
    local dstIP = parseIPAddress(DST_IP)

    -- send out UDP packets until the user stops the script
    while mg.running() do
        bufs:alloc(size)
        for i, buf in ipairs(bufs) do
            local pkt = buf:getUdpPacket()
            pkt.ip4.src:set(baseIP)
            pkt.ip4.dst:set(dstIP)
        end
        -- UDP checksums are optional, so using just IPv4 checksums would be sufficient here
        --bufs:offloadUdpChecksums()
        queue:send(bufs)
        txCtr:update()
    end
    txCtr:finalize()
end

Do you have any best practice when scaling to multiple queues and cores for the same NIC? I also tried to use the tx-multi-core.lua test you used for your paper but those scripts are not compatible anymore. Cheers

emmericp commented 3 years ago

Can you post your code that you use for receiveSlave?

marcofaltelli commented 3 years ago

Ooops, sorry I forgot to paste it. There it is:

function receiveSlave(rxQueue, rxDev, size)
    log:info(green("Starting up: ReceiveSlave"))

    local mempool = memory.createMemPool()
    local rxBufs = mempool:bufArray()
    local rxCtr = stats:newDevRxCounter(rxDev, "plain")

    -- this will catch a few packet but also cause out_of_buffer errors to show some stats
    while mg.running() do
        local rx = rxQueue:tryRecvIdle(rxBufs, 10)
        rxBufs:freeAll()
        rxCtr:update()
    end
    rxCtr:finalize()
end

emmericp commented 3 years ago

That should work, not sure what is going on here, I'll need to test this on real hardware; I'll get back to this

FedeParola commented 3 years ago

Hi @emmericp, I think I'm having a similar problem. I use this simple, stripped down example to test multi-core performance:

local mg     = require "moongen"
local memory = require "memory"
local device = require "device"
local stats  = require "stats"

local PKT_SIZE  = 60

function configure(parser)
    parser:description("Generates traffic.")
    parser:argument("dev", "Device to transmit from."):convert(tonumber)
    parser:option("-c --core", "Number of cores."):default(1):convert(tonumber)
end

function master(args)
    dev = device.config({port = args.dev, txQueues = args.core})
    device.waitForLinks()

    for i=0,args.core-1 do
        mg.startTask("loadSlave", dev:getTxQueue(i))
    end

    local ctr = stats:newDevTxCounter(dev)

    while mg.running() do
        ctr:update()
        mg.sleepMillisIdle(10)
    end

    ctr:finalize()
end

function loadSlave(queue)
    local mem = memory.createMemPool(function(buf)
        buf:getUdpPacket():fill({
            pktLength=PKT_SIZE
        })
    end)
    local bufs = mem:bufArray()

    while mg.running() do
        bufs:alloc(PKT_SIZE)
        queue:send(bufs)
    end
end

On an Intel Xeon Gold 5120 with 14 physical cores (HyperThreading disabled) I get the following numbers:	Cores	Mpps
1	21.42
2	15.44
3	13.75
4	13.87
5	13.69
6	13.81

On another machine with an Intel Xeon E3-1245 4 cores + HyperThreading (8 logical cores) I get the following:	Cores	Mpps
1	21.40
2	34.64
3	33.96
4	34.65
5	42.62
6	42.65

In this last case I'm able to saturate the link but I'm wasting a lot of cores. On both machines I'm able to saturate the link with just two cores using pktgen-dpdk (v20.11.3 on dpdk 20.08)

emmericp / MoonGen

Intel XL710 40Gbps saturation #289