amzn / amzn-drivers

Official AWS drivers repository for Elastic Network Adapter (ENA) and Elastic Fabric Adapter (EFA)
453 stars 174 forks source link

[Bug]: XDP loading fails on amazon/RHEL-9.3.0_HVM-20240229-x86_64 #311

Open borkmann opened 2 months ago

borkmann commented 2 months ago

Preliminary Actions

Driver Type

Linux kernel driver for Elastic Network Adapter (ENA)

Driver Tag/Commit

5.14.0-427.24.1.el9_4.x86_64

Custom Code

No

OS Platform and Distribution

cat /etc/redhat-release 
Red Hat Enterprise Linux release 9.4 (Plow)

amazon/RHEL-9.3.0_HVM-20240229-x86_64-27-Hourly2-GP3

# modinfo ena
filename:       /lib/modules/5.14.0-427.24.1.el9_4.x86_64/kernel/drivers/net/ethernet/amazon/ena/ena.ko.xz
license:        GPL
description:    Elastic Network Adapter (ENA)
author:         Amazon.com, Inc. or its affiliates
rhelversion:    9.4
srcversion:     A37C4DC49D386CE977A6090
alias:          pci:v00001D0Fd0000EC21sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd0000EC20sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00001EC2sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00000EC2sv*sd*bc*sc*i*
alias:          pci:v00001D0Fd00000051sv*sd*bc*sc*i*
depends:        
retpoline:      Y
intree:         Y
name:           ena
vermagic:       5.14.0-427.24.1.el9_4.x86_64 SMP preempt mod_unload modversions 
sig_id:         PKCS#7
signer:         Red Hat Enterprise Linux kernel signing key
sig_key:        [...]
sig_hashalgo:   sha256
# ethtool -i eth0
driver: ena
version: 5.14.0-427.24.1.el9_4.x86_64
firmware-version: 
expansion-rom-version: 
bus-info: 0000:00:05.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

And from dmesg:

[    1.944167] ena 0000:00:05.0: ENA device version: 0.10
[    1.944171] ena 0000:00:05.0: ENA controller version: 0.0.1 implementation version 1
[    1.956127] ena 0000:00:05.0: Elastic Network Adapter (ENA) found at mem c0510000, mac addr 02:f6:cf:1b:b2:59

Bug description

When trying to load our XDP program in Cilium on RHEL9.4, we're running into the following error in dmesg which seems correlated timing-wise:

[...]
[  721.887442] IPv6: ADDRCONF(NETDEV_CHANGE): cilium_host: link becomes ready
[  721.904701] ipip: IPv4 and MPLS over IPv4 tunneling driver
[  721.910132] cilium_tunl: renamed from tunl0
[  723.393109] IPv6: ADDRCONF(NETDEV_CHANGE): lxc_health: link becomes ready
[  723.611319] eth0: renamed from tmpe8f4f
[  723.616308] IPv6: ADDRCONF(NETDEV_CHANGE): lxc70db2459b4ec: link becomes ready
[  723.653331] eth0: renamed from tmp90743
[  723.657759] IPv6: ADDRCONF(NETDEV_CHANGE): lxca714e5ba4073: link becomes ready
[ 3037.066035] ena 0000:00:05.0 eth0: Command parameter 46 is not supported     <----------------

And no XDP program got loaded on the device itself:

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 3498 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 02:f6:cf:1b:b2:59 brd ff:ff:ff:ff:ff:ff
    altname enp0s5
    altname ens5

Is the ena driver regularly updated via HW enablement on RHEL9.x? We have users where self-building a driver before use would unfortunately not be an option for production.

We've seen somewhat related issues (#78, amzn/amzn-drivers#241) where this error might hint to XDP.

I hope this is still the right place to ask if it was fixed upstream, perhaps you have a chance to poke Red Hat folks to backport the relevant commits into RHEL9.

Reproduction steps

ip link set dev eth0 mtu 3498
ethtool -L eth0 combined 2
(we also tried with combined 1)
load XDP

Expected Behavior

XDP program loads onto ena driver

Actual Behavior

[ 3037.066035] ena 0000:00:05.0 eth0: Command parameter 46 is not supported

Additional Data

No response

Relevant log output

No response

Contact Details

daniel@isovalent.com

borkmann commented 2 months ago

Small update, with the self-compiled driver we ended up at the following rejection now (which also confirms implicitly that ena on RHEL lacks some backports):

[  905.417013] ena: loading out-of-tree module taints kernel.
[  905.417109] ena: module verification failed: signature and/or required key missing - tainting kernel
[  905.421008] ena 0000:00:05.0: Elastic Network Adapter (ENA) v2.12.3g
[  905.430550] ena 0000:00:05.0: ENA device version: 0.10
[  905.430552] ena 0000:00:05.0: ENA controller version: 0.0.1 implementation version 1
[  905.530529] ena 0000:00:05.0: ENA Large LLQ is disabled
[  905.542476] ena 0000:00:05.0: Elastic Network Adapter (ENA) found at mem c0510000, mac addr 02:f6:cf:1b:b2:59
[  905.558391] ena 0000:00:05.0 eth0: Local page cache is disabled for less than 16 channels
[  906.385518] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1160.334921] ena 0000:00:05.0 eth0: Command parameter 46 is not supported
[ 1260.180617] ena 0000:00:05.0 eth0: XDP program is set, changing the max_mtu from 9216 to 3498
[ 1260.875419] ena 0000:00:05.0 eth0: xdp: dropped unsupported multi-buffer packets

So it looks like the Command parameter 46 is not supported may not have been directly related (?).

Are there plans to support XDP mbuf / multi-buffer for ena? Fwiw, that would lift the requirement to lower the MTU under XDP.

akiyano commented 2 months ago

Hi @borkmann

Thank you for you inquiry.

The message “Command parameter 46 is not supported” comes from the function ena_get_rxnfc() and is printed because currently the driver does not support flow steering. It is unrelated to xdp being able to load (as you saw when you compiled the github driver). If you don’t rely on flow steering you can probably ignore it.

I tried loading xdp programs on RHEL 9 with your kernel and got an error (which may be different that what you see in dmesg), so indeed it seems there is an issue with xdp support in the driver that comes with this kernel. We will look into it, thank you for the heads up.

It seems that your issue with this kernel may be different than what I see. It may be helpful if you could share with me:

  1. either the XDP program you are trying to load, and/or describe what it does. You can send it to me at akiyano@amazon.com
  2. dmesg output after the “Command parameter 46 is not supported” print.

I can’t answer your question whether it was fixed in upstream linux until I root cause your issue. But the issue I see myself was indeed fixed in upstream linux, and not backported yet to RHEL 9.

Do I understand correctly that you are able to run your xdp program when using the latest github driver that you built yourself on RHEL 9?

As for support for xdp multi-buffer for ena, this is currently under development, it will indeed lift the requirement to lower the MTU under XDP, but I can’t share here the timeline of release.

Arthur

akiyano commented 2 months ago

Hi @borkmann,

Another thing. We are aware of an issue with XDP_REDIRECT not currently working on RHEL 9 with the latest github driver. We have a fix and it will be released in the next releases. Meanwhile, if you need XDP_REDIRECT for your testing on RHEL 9, please use the attached workaround patch.

0001-Temporary-fix-for-XDP_REDIRECT-not-working-on-RHEL-9.patch

Arthur

akiyano commented 1 month ago

Hi @borkmann,

You've originally attached your dmesg when failing to load the xdp program up to :

[ 3037.066035] ena 0000:00:05.0 eth0: Command parameter 46 is not supported <----------------

Can you please share with me (here or via mail akiyano@amazon.com) what happens in dmesg after that? When I encounter an xdp loading issue with this kernel I have more prints, and I'd like to make sure you are seeing the same issue as I am, so that when I try fixing it I know the fix will also help your case.

Thanks!

akiyano commented 1 month ago

Hi @borkmann, We expect this issue to be addressed in the upcoming RHEL 9.5 release that should be released near the end of the 2024. See https://www.redhat.com/en/blog/upcoming-improvements-red-hat-enterprise-linux-minor-release-betas?sc_cid=701f2000000tyBjAAI regarding release schedule.

borkmann commented 1 month ago

We expect this issue to be addressed in the upcoming RHEL 9.5 release that should be released near the end of the 2024. See https://www.redhat.com/en/blog/upcoming-improvements-red-hat-enterprise-linux-minor-release-betas?sc_cid=701f2000000tyBjAAI regarding release schedule.

Awesome, thanks so much!

borkmann commented 1 month ago

Hi @borkmann,

You've originally attached your dmesg when failing to load the xdp program up to :

[ 3037.066035] ena 0000:00:05.0 eth0: Command parameter 46 is not supported <----------------

Can you please share with me (here or via mail akiyano@amazon.com) what happens in dmesg after that? When I encounter an xdp loading issue with this kernel I have more prints, and I'd like to make sure you are seeing the same issue as I am, so that when I try fixing it I know the fix will also help your case.

Cc'ing @strongjz . We've seen this in dmesg:

[  905.558391] ena 0000:00:05.0 eth0: Local page cache is disabled for less than 16 channels
[  906.385518] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1160.334921] ena 0000:00:05.0 eth0: Command parameter 46 is not supported
[ 1260.180617] ena 0000:00:05.0 eth0: XDP program is set, changing the max_mtu from 9216 to 3498
[ 1260.875419] ena 0000:00:05.0 eth0: xdp: dropped unsupported multi-buffer packets

We basically weren't sure whether the command parameter 46 was related or not which was why we started asking in here. The error message related to XDP was lack of multi-buffer support for XDP.

akiyano commented 1 month ago

@borkmann,

To make sure we are on the same page, I'm still not 100% sure what the issue you are seeing with the RHEL 9.3 preinstalled ENA driver that you don't see with the github driver.

There are some known issues that are present up to RHEL 9.4, for which the bug fixes will be backported in RHEL 9.5 but from what you are saying I'm not sure you are experiencing them.

Are you experiencing issues with the driver preinstalled in RHEL 9.3, that are not present in the github driver? What are they?

Regarding your last message:

  1. The message “Command parameter 46 is not supported” comes from the function ena_get_rxnfc() and is printed because currently the driver does not support flow steering. It is unrelated to xdp being able to load (as you saw when you compiled the github driver). If you don’t rely on flow steering you can probably ignore it.
  2. XDP multi-buffer support for XDP in ENA is still under development internally and is not yet released.
strongjz commented 1 month ago

We've had two other folks run 9.3 default Ena driver fine. The 9.4 for me caused issues. I'm going to test next week with 9.3.

akiyano commented 1 month ago

@strongjz Can you please specify what you are running and what the issues are?