amzn / amzn-drivers

Official AWS drivers repository for Elastic Network Adapter (ENA) and Elastic Fabric Adapter (EFA)
453 stars 175 forks source link

ena xdp native mode much slower than xdp skb/generic for XDP_TX #264

Closed agentzh closed 1 year ago

agentzh commented 1 year ago

Hi, dear ena developers

We've been fighting with an ena driver performance issue with the XDP_TX operations on EC2 instances. And hopefully you can help us out.

We've noticed that using the xdp native mode with the ena driver suffers performance issues for ebpf programs using XDP_TX to send back packets directly to the same eth0 interface.

For example, on a c5n.2xlarge EC2 instance, the ebpf program is loaded into the eth0 interface with the XDP native mode (or driver mode):

$ ip link
...
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 3000 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 02:9e:ab:77:93:41 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 138

And the tx/rx queue counts are configured like this:

$ ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX:     n/a
TX:     n/a
Other:      n/a
Combined:   8
Current hardware settings:
RX:     n/a
TX:     n/a
Other:      n/a
Combined:   4

And I run the dnsgen command from another c5n.4xlarge instance in the same VPC subnet.

The maximum QPS for the reply packets from the XDP bpf program is like this:

Peak RX rate = 1035320

Just a little over 1M RX PPS from the client instance's perspective (or TX PPS from the xdp bpf server's perspective).

The CPU usage on the server side looks normal, far from being maxed out:

 %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni, 86.7 id,  0.0 wa,  0.3 hi, 12.9 si,  0.0 st

And the client instance also has plenty of CPU resource.

But this is less than half of the peak PPS when xdp generic/skb mode is used for the same software and hardware. The XDP generic/skb achieves more than 2.3M PPS:

Peak RX rate = 2344890

Back to the XDP native/driver mode, we checked the stat counters of the ena driver (using the latest version in this git repo, 2.8.3g). The only abnormal counters are as follows.

         queue_4_xdp_tx_prepare_ctx_err: 2300032
         queue_5_xdp_tx_prepare_ctx_err: 2517038
         queue_6_xdp_tx_prepare_ctx_err: 2367258
         queue_7_xdp_tx_prepare_ctx_err: 1860980
         pps_allowance_exceeded: 72030

The last one, pps_allowance_exceeded, is small, considering how many packets we send in the test, and thus can be ignored. The most interesting ones are the previous 4, those *_prepare_ctx_err counters. By checking the driver log messages (yeah, we had to hack the ena driver to avoid sending such error messages too fast to dmesg, otherwise, we couldn't see them):

[ 1033.943058] ena 0000:00:05.0 eth0: Not enough space in the tx queue

There are a huge number of such errors generated by the driver. And the TX ring size is always 1024, which is very small for such XDP ebpf programs using XDP_TX:

$  ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:     16384
RX Mini:    n/a
RX Jumbo:   n/a
TX:     1024
Current hardware settings:
RX:     1024
RX Mini:    n/a
RX Jumbo:   n/a
TX:     1024

We've noted that this TX ring size is always 1024 even on much larger EC2 instance types like c5n.9xlarge though the RX ring size gets big, like 16384 here, it won't help with the XDP_TX problem we have here.

Even on large instances like c5n.9xlarge, the PPS of the XDP_TX packets is lower than the XDP generic/skb mode, though with a much smaller gap (2.28M PPS vs 2.86M PPS). We think it is because of the much larger number of TX queues, which reduces the impact of the small TX ring size. Other cloud vendors like Aliyun provide deeper TX rings/queues:

 $ sudo ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:     4096
RX Mini:    n/a
RX Jumbo:   n/a
TX:     4096
Current hardware settings:
RX:     4096
RX Mini:    n/a
RX Jumbo:   n/a
TX:     4096

The TX queue overflow issue is much worse on smaller instance types like c5n.xlarge and c5n.large. The TX PPS is much more miserable there.

Another disturbing thing is that for the same TX PPS throughput, the ena driver's XDP native support uses a similar amount of CPU time (the si part) as the XDP generic/skb mode. This is quite unexpected, given that the primary motivation of the XDP native mode is to save more (CPU) resources. The perf top -g shows the following hot spots in the whole system:

+   90.32%    13.17%  [kernel]                  [k] __softirqentry_text_start
+   88.62%     0.11%  [kernel]                  [k] do_idle
+   88.28%     0.04%  [kernel]                  [k] cpuidle_enter
+   88.24%     0.38%  [kernel]                  [k] cpuidle_enter_state
+   87.85%     0.09%  [kernel]                  [k] acpi_safe_halt
+   77.01%     0.17%  [kernel]                  [k] net_rx_action
+   76.82%     0.03%  [kernel]                  [k] __napi_poll
+   58.16%    17.48%  [kernel]                  [k] ena_io_poll
+   18.62%     1.93%  [kernel]                  [k] ena_xdp_io_poll

I don't know how to make it faster here. It'll be great if you can shed some light here. On Aliyun cloud, for example, using the XDP native/driver mode usually saves more than 50% of the CPU (si) time for the same eBPF program and load compared to the XDP generic/skb mode. Unfortunately we saw the reverse on AWS EC2 :(

Thanks in advance for your help!

agentzh commented 1 year ago

I forgot to mention that the packet size in the tests above is around 500 bytes.

agentzh commented 1 year ago

And we also tested an older version of the ena driver, the one shipped with the 5.13 stock kernel. Its performance is significantly worse than the latest version of ena for the XDP native/driver mode. And we always use the 5.13 stock kernel for all our testing above.

davidarinzon commented 1 year ago

Hi @agentzh

Thanks for taking the time to report this issue, we will look into this and update you with our findings.

agentzh commented 1 year ago

@davidarinzon Thanks for your attention, and we're looking forward to your findings.

agentzh commented 1 year ago

@davidarinzon The virtio_net kernel driver has a similar issue of flooding dmesg errors when the TX queue is full and the XDP_TX operation is used. The following patch just submitted to LKML might have addressed it?

https://lore.kernel.org/all/20230306041535.73319-1-xuanzhuo@linux.alibaba.com/

Thanks Xuan Zhuo from Aliyun looked into this issue in virtio_net. Maybe it's also applicable to the ena driver to address at least some of the problems reported here? Thanks!

davidarinzon commented 1 year ago

Thank you @agentzh for looking into this, we will look into these patchsets to identify whether the issues found in virtio_net may be applicable to this case.

agentzh commented 1 year ago

@davidarinzon The virtio_net patch just avoids the dmesg error flooding. It cannot help with the underlying TX queue/ring overflow issue of XDP_TX. The underlying issue is much more severe in AWS EC2 since the TX ring size is only 1024.

agentzh commented 1 year ago

@davidarinzon Any updates or progress on this matter, please?

amitbern-aws commented 1 year ago

@agentzh The ENA does not support deeper TX queues beyond a maximum value of 1024. Currently, we are engaged in efforts to reproduce the issue. To further elaborate, could you kindly provide the following details:

  1. Could you please specify the exact testing scenario, including the client, middleman, and server involved, as well as the protocol and dnsgen command used?
  2. What is the type and version of the kernel being used?
  3. You mentioned reproducing the issue with a combination of instance types. Could you please specify the exact instance types being used, how they are connected, and the PPS value obtained for each combination?
  4. What is the purpose of the eBPF program you are running, and would it be possible to share it?
  5. Would it be possible for you to verify whether the issue reproduces on the c6i instance family?
  6. Could you please share the instance IDs with us?

Please feel free to contact me directly at amitbern@amazon.com to facilitate further debugging.

Thank you.

agentzh commented 1 year ago

Hi @amitbern-aws

Thanks for your reply! And sorry for my late reply. I didn't see your replies until today.

To answer your questions:

  1. Could you please specify the exact testing scenario, including the client, middleman, and server involved, as well as the protocol and dnsgen command used?

It is a c5n.4xlarge instance (as a client) stressing another c5n.2xlarge instance (as the server) running inside the same VPC subnet. They communicate via private IP addresses only. I'm using a Fedora 33 x86_64 operating system.

  1. What is the type and version of the kernel being used?

It is a stock 5.13.18 kernel compiled from source.

  1. You mentioned reproducing the issue with a combination of instance types.

We can just look at the c5n.2xlarge -> c5n.2xlarge combination. Other combinations do not really differ.

Could you please specify the exact instance types being used, how they are connected, and the PPS value obtained for each combination?

c5n.4xlarge (client) -> c5n.2xlarge (server) connecting via private IP address and UDP protocol (they are in the same VPC subnet) PPS value obtained: 1035320

What is the purpose of the eBPF program you are running, and would it be possible to share it?

The purpose of the eBPF program is a DNS authoritative server that uses XDP_TX to send back DNS replies directly for any incoming DNS queries.

Sorry, we cannot share the source code of the eBPF program due to licensing and IP restrictions. But I believe this problem can easily be reproduced with the simplest eBPF XDP program which just sends back something via XDP_TX.

  1. Would it be possible for you to verify whether the issue reproduces on the c6i instance family?

Sure! I'll give it a shot and get back to you soon. Seems like c6in has better network performance than c6i? I can just try c6in.large here?

Could you please share the instance IDs with us?

I''ve just sent it to your email address.

Many thanks!

agentzh commented 1 year ago

@amitbern-aws I just tried, I cannot change the instance type to c6i.large or c6in.large. The AWS console says "The instance configuration for this AWS Marketplace product is not supported. Please see the AWS Marketplace site for more information about supported instance types, regions, and operating systems." It is using a Fedora 32 x86_64 image from the AWS marketplace. Anyway to work around this without switching the whole image?

agentzh commented 1 year ago

@amitbern-aws Oh, sorry, I missed one of your questions above:

as well as the protocol and dnsgen command used?

The dnsgen command I used is like this:

sudo ./dnsgen -b 512 -s 172.31.30.5 -d domains.txt -p 53 -l 100 -T $(nproc) -i eth0 -a 172.31.23.82 -r 2200000 -R 100000 -m 02:9e:ab:77:93:41
amitbern-aws commented 1 year ago

@agentzh

  1. You are using ami-0a4e3b085497e3086 (Fedora 32 Cloud Base Images (x86_64) HVM) which doesn't support c6 instances, can you try to reproduce using Amazon Linux 2 or Amazon Linux 2023 AMIs?
  2. Just to clarify, in the following test setup: c5n.4xlarge (client) -> c5n.2xlarge (server) Packets are sent from client (c5n.4x), received in the server (c5n.2x), marked as XDP_TX in the eBPF program and retransmitted from the server using the same interface to where? (back to the c5n.4x client?)
agentzh commented 1 year ago

@amitbern-aws

can you try to reproduce using Amazon Linux 2 or Amazon Linux 2023 AMIs?

Unfortunately, all the Fedora Cloud Base Images do not support c6. Alas. Switching the Linux distro to Amazon Linux will involve a lot of work on our side. And we currently don't have that much time to do it. Maybe in the future...

Packets are sent from client (c5n.4x), received in the server (c5n.2x), marked as XDP_TX in the eBPF program and retransmitted from the server using the same interface to where? (back to the c5n.4x client?)

Yes. The client runs dnsgen. The server runs the xdp bpf program. The client sends a DNS query to the server and then the server returns a DNS reply to the client.

amitbern-aws commented 1 year ago

@agentzh How many UDP streams are you running in parallel from the client (c5n.4xlarge) to the XDP server (c5n.2xlarge)?

agentzh commented 1 year ago

@amitbern-aws The dnsgen tool tries to send the DNS query packets as fast as possible (as long as it can receive similar amount of DNS reply packets). And it uses the same number of OS threads as the number of CPU cores on that c5n.4x instance. Please refer to the dnsgen command I provided above for more details. Thanks!

I was using the open source version of the dnsgen without any patches.

agentzh commented 1 year ago

@amitbern-aws Because using the XDP generic mode can achieve way more PPS than the XDP native mode using the ena driver, I don't think the client is at fault here or the line between these 2 instances. The only difference is the XDP mode.

amitbern-aws commented 1 year ago

@agentzh

The performance issue you encounter is associated with the design of our XDP combined with the ENA device architecture for this particular instance family. The disparity between the TX and RX rates, as measured in PPS, can be attributed to prioritization between RX and TX within the ENA device.

Regarding the PPS difference observed between generic and native XDP modes, there are various factors that impact the PPS difference:

We have several recommendations to mitigate this issue:

I hope this clarifies things for you. Thanks

agentzh commented 1 year ago

@amitbern-aws Thanks for your detailed reply!

I have some further questions:

In generic XDP mode, you can use all of the available TX/RX queues (8 queues on c5n.2xlarge), while in native XDP mode, you can only use half of the queues (4 RX queues and 8 TX queues - 4 regular and 4 XDP queues).

Is there anyway to work around the "half of the queues" limitation in the ena driver?

So, if I understand it correctly, there's nothing we can do with the c5n instances or even smaller instances to make the xdp native mode more performant than the xdp generic mode?

We also have to balance the cost ($$) and the PPS performance. It's not like we can always choose the best instance types ;)

Consider using next-gen instances, such as C6i or M6i, which have improved balance between TX and RX for your use case. So on those instances, the PPS of XDP native + XDP_TX will outperform the XDP generic mode? Will you share some numbers if you have any?

Many thanks!

amitbern-aws commented 1 year ago

@agentzh

  1. No, sorry, there is currently no work around for the 50% queues limit.
  2. yes, c5n instances with your environment will probably not suit your needs

Thanks

amitbern-aws commented 1 year ago

@agentzh One more update, We are preparing Marketplace Fedora 37 AMI which will support the C6I instance family, I will update once its ready. And you can also try Fedora 37 community AMIs which already supports C6I from: https://alt.fedoraproject.org/en/cloud/

Thanks