amzn / amzn-drivers

Official AWS drivers repository for Elastic Network Adapter (ENA) and Elastic Fabric Adapter (EFA)
453 stars 174 forks source link

[Support]: Problem with NAPI busy poll ENA and/or creating a working setup with 1 RX/TX queue #312

Closed lano1106 closed 1 month ago

lano1106 commented 1 month ago

Preliminary Actions

Driver Type

Linux kernel driver for Elastic Network Adapter (ENA)

Driver Tag/Commit

kernel 6.9.10-1-ec2 (recompiled by myself from kernel.org git)

modinfo ena
filename:       /lib/modules/6.9.10-1-ec2/kernel/drivers/net/ethernet/amazon/ena/ena.ko
license:        GPL
description:    Elastic Network Adapter (ENA)
author:         Amazon.com, Inc. or its affiliates
alias:          pci:v00001D0Fd0000EC21sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd0000EC20sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00001EC2sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00000EC2sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00000051sv*sd*bc*sc*i*
depends:        
intree:         Y
name:           ena
vermagic:       6.9.10-1-ec2 SMP mod_unload 

$ ethtool -i enp39s0
driver: ena
version: 6.9.10-1-ec2
firmware-version: 
expansion-rom-version: 
bus-info: 0000:27:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

Custom Code

No

OS Platform and Distribution

Linux 6.9.10-1-ec2 ArchLinux

Support request

My instance type is a c7i.2xlarge. Hyperthreading disabled so 4 CPU

kernel cmdline: ipv6.disable=1 hugepages=72 isolcpus=1,2,3 nohz_full=1,2,3 rcu_nocbs=1,2,3 rcu_nocb_poll irqaffinity=0 idle=nomwait processor.max_cstate=1 intel_idle.max_cstate=1 nmi_watchdog=0

my application is using io_uring with NAPI busy polling and I assign CPUs to threads having the highest latency requirements possible.

CPU 1 is isolated and a single thread is assigned to it. This thread has an io_uring configured with struct io_uring_napi napiSetting{200, 1}; io_uring_register_napi(ring, &napiSetting); CPU1 thread manage about 20 TCP connections

created sockets also have the options SO_BUSY_POLL and SO_PREFER_BUSY_POLL set but I think that io_uring does not use them.

A SQPOLL kernel thread is created and its CPU affinity is set to 2

CPU3 task manage 2 TCP sockets

CPU3 has the second low latency thread created. It also have an io_uring with NAPI busy poll enabled. Its ring is attached to CPU1 ring SQPOLL worker:


/*
 * getLoopBE()
 */
unsigned WS_Private::getLoopBE(struct io_uring_params *params,
                                      io_uring_probe *probe) const noexcept
{
    // Uncomment to have a SQPOLL thread for the Private WS connection.

    if (global_t::getInstance().usePubSqPoll()) {
/*
        params->flags         |= IORING_SETUP_SQPOLL;

        // Constant defined in ev_util.h
        params->sq_thread_idle = Kraken::SQ_THREAD_IDLE_VAL;

        params->flags |= IORING_SETUP_SQ_AFF;
        params->sq_thread_cpu = get_private_sq_thread_cpu();
*/
        // Uncomment to attach WQ to the master
        params->flags |= IORING_SETUP_ATTACH_WQ;
    }

    params->flags |= IORING_SETUP_COOP_TASKRUN;
    return Parent::getLoopBE(params, probe);
}

its io_uring is setup for NAPI busy polling in the same way than CPU1 thread does. Its TCP socket options have the same treatment as well:

    if (global_t::getInstance().useNapiBusyPoll(getTsi())) {
        optval = TrillionTrader::BUSY_POLL_VAL;
        if (unlikely(setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL,
                                &optval, sizeof(int)) < 0)) {
            ERROR2("%d Failed to set SO_BUSY_POLL: %s",
                   WebSocketContext::get_service_tid(), strerror(errno));
        }
        optval = true;
        if (unlikely(setsockopt(fd, SOL_SOCKET, SO_PREFER_BUSY_POLL,
                                &optval, sizeof(int)) < 0)) {
            ERROR2("%d Failed to set SO_PREFER_BUSY_POLL: %s",
                   WebSocketContext::get_service_tid(), strerror(errno));
        }
    }

CPU0 is reserved for all the other processes of the system. Less than 1%.

Other settings:

#!/bin/bash

#
# Setup the ENA interface for low-latency
#
# To read the settings:
# ethtool -c enp39s0
#

ethtool -G enp39s0 rx 4096
ethtool -L enp39s0 combined 1
# must be placed *after* 'ethtool -L enp39s0 combined 1' as it reset adaptive-rx to on
ethtool -C enp39s0 adaptive-rx off rx-usecs 0 tx-usecs 0
echo 1000 > /sys/class/net/enp39s0/napi_defer_hard_irqs
echo 500 > /sys/class/net/enp39s0/gro_flush_timeout

(the napi_defer_hard_irqs and gro_flush_timeout parameters are following recommendation found in Documentation/networking/napi.rst)

first problem: NAPI busy poll is done by the io_uring sqpoll worker thread which runs at close to 100%. Despite doing the busy polling, the ENA driver is generating plenty of interrupts:

67: 235620161 0 0 0 PCI-MSIX-0000:27:00.0 1-edge enp39s0-Tx-Rx-0 68: 1 211480332 0 0 PCI-MSIX-0000:27:00.0 2-edge enp39s0-Tx-Rx-1 69: 1 0 207685214 0 PCI-MSIX-0000:27:00.0 3-edge enp39s0-Tx-Rx-2 70: 1 0 0 243540978 PCI-MSIX-0000:27:00.0 4-edge enp39s0-Tx-Rx-3

I am doing NAPI busy polling precisely to avoid my low-latency threads to be interrupted from what they are doing... Each spurious interrupt is inducing a 20-50 uSec delay to my threads...

this seems like a driver bug that the driver still issue interrupts with the settings and the current usage.

problem 2: I have tried to workaround the issue with ethtool -L enp39s0 combined 1

the only remaining queue is the 1 binded to CPU0... but I have this error message in the kernel log: Jul 26 04:11:44 ip-172-31-39-89 kernel: ena 0000:27:00.0 enp39s0: Command parameter 46 is not supported

When I restart my application, all is good except that the sockets created on CPU3 ends up missing incoming data.

I see the server HTTP reply: [2024-07-26 03:48:45] INFO WSCTX/log_emit_function 7194: Server reply: HTTP/1.1 101 Switching Protocols Upgrade: websocket Connection: Upgrade Sec-WebSocket-Accept: QpYBhzQ2+wnF5mbBu+NkBHnB0Xk= Date: Fri, 26 Jul 2024 03:48:44 GMT

but the connection eventually times out.

I have tried

ethtool -L enp39s0 combined 2
ethtool -L enp39s0 combined 3

same thing...

the only way that CPU3 thread can have its TCP sockets serviced adequately is with: ethtool -L enp39s0 combined 4

If I would have be able to make a 1 queue setup that sends all its interrupt to CPU0, it would have been ok for me...

In conclusion:

  1. Despite doing NAPI busy poll, the driver still issue a lot of interrupts very generously. It is unexpected.
  2. A 1 queue setup does not work at all for me...

Contact Details

olivier@trillion01.com

lano1106 commented 1 month ago

I did some progress... It appears that in regards to busy polling, the driver works as expected. I just had to choose better values

busy_poll_period = 50 with
echo 1000 > /sys/class/net/enp39s0/napi_defer_hard_irqs

brings the NIC interrupt count to a halt... but I have the feeling to play at wack-a-mole... now this busy-polling appears to have awaken new kernel process which forces the kernel to generate again local timer interrupts.

 67:  242886291          0          0          0  PCI-MSIX-0000:27:00.0   1-edge      enp39s0-Tx-Rx-0
 68:          1  217801693          0          0  PCI-MSIX-0000:27:00.0   2-edge      enp39s0-Tx-Rx-1
 69:          1          0  216506015          0  PCI-MSIX-0000:27:00.0   3-edge      enp39s0-Tx-Rx-2
 70:          1          0          0  250752987  PCI-MSIX-0000:27:00.0   4-edge      enp39s0-Tx-Rx-3
NMI:          0          0          0          0   Non-maskable interrupts
LOC:   35041049   17865096   18656563   25358830   Local timer interrupts
lano1106 commented 1 month ago

thx if you look into what I see with your knowledge to perhaps help me to find a fix. OTOH, it seems like the problem might come from io_uring... I have opened an issue on their github page: https://github.com/axboe/liburing/issues/1190

lano1106 commented 1 month ago

in its most simplified form...

with: ethtool -L enp39s0 combined 4

ring 2 sockets managed by the thread running on CPU3 works perfectly

with ethtool -L enp39s0 combined 1

ring 2 sockets managed by the thread running on CPU3 are not serviced correctly.

What remains to determine is if the driver has something to do with this situation or if the problem exclusively comes from io_uring

lano1106 commented 1 month ago

fyi, I have found my issue. your driver is perfectly working.

if you perform network operations on an isolated nohz_full processor, the networking softirqs for that processor will never be invoked.