flatcar / Flatcar

Flatcar project repository for issue tracking, project documentation, etc.
https://www.flatcar.org/
Apache License 2.0
653 stars 27 forks source link

Azure Flatcar qdisc issue #1458

Open naioja opened 1 month ago

naioja commented 1 month ago

In Azure when it comes to Linux performance networking we use SR-IOV with Mellanox drivers (mlx4 or mlx5), something specific to Azure is that this creates two interfaces a synthetic and a virtual interface, documentation about it can be found here : https://learn.microsoft.com/en-us/azure/virtual-network/accelerated-networking-how-it-works.

I believe the actual bond is create by the hv_netvsc kernel module and as we can see in the output below the enP* interface is picked up by the OS as a stand-alone interface and gets a qdisc attached ot it (Azure flatcar LTS VM image output):

ip address show 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
3: enP50947s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 **qdisc mq master eth0 state UP** group default qlen 1000
tc qdisc show 
qdisc noqueue 0: dev lo root refcnt 2
qdisc mq 0: dev eth0 root
qdisc fq_codel 0: dev eth0 parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc mq 0: dev enP50947s1 root
qdisc fq_codel 0: dev enP50947s1 parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64

I believe this to be a faulty configuration in Azure flatcar VMs using SR-IOV (accelerated networking) as we usually do not apply queuing disciplines to bridged or bonded interfaces like docker0 or virbr0.

ip a s | egrep '(eth0|docker|br-88dd68ef9e6a)'
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 10000
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
57: br-88dd68ef9e6a: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
61: vethe268329@if60: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-88dd68ef9e6a state UP group default

This has other implications on how systemd applies the default setting for net.core.default_qdisc too, misusing the file /lib/sysctl.d/50-default.conf.

One of the (simple) fixes that I found was to apply the below “tuned” udev configuration for interface queuing disciplines:

/etc/udev/rules.d/99-azure-qdisc.rules
ACTION=="add|change", SUBSYSTEM=="net", KERNEL=="enP*", PROGRAM="/sbin/tc qdisc replace dev $env{INTERFACE} root noqueue"
ACTION=="add|change", SUBSYSTEM=="net", KERNEL=="eth*", PROGRAM="/sbin/tc qdisc replace dev $env{INTERFACE} root fq maxrate 12.5gbit limit 100000"

Specifically for my tests I’ve set the maxrate for the fq queuing discipline to match the VM SKU interface line speed of 12.5 Gbit and I’ve upped the limit of processed packets to 100K since the limits set upstream are a bit random and arbitrary and may not be the best choice for running VMs in the cloud and expecting higher networking performance.

jepio commented 1 month ago

Hi @naioja, Have you found that the VM is unable to achieve the line speed with the default configuration?

naioja commented 1 month ago

Hi @naioja, Have you found that the VM is unable to achieve the line speed with the default configuration?

Hi Jeremi,

At the moment a network packet will flow thru both queuing disciplines and that's not optimal and indeed that impacts performance.

Interface enP39504s1

tc -s qdisc show dev enP39504s1
qdisc fq 8002: root refcnt 33 limit 100000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b maxrate 12500Mbit low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 9906423866 bytes 6824193 pkt (dropped 0, overlimits 0 requeues 14677)
 backlog 0b 0p requeues 14677
  flows 110 (inactive 107 throttled 0)
  gc 0 highprio 13 throttled 43812 latency 17.8us

Interface eth0

tc -s qdisc show dev eth0
qdisc fq 8001: root refcnt 65 limit 100000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b maxrate 12500Mbit low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 9906434943 bytes 6824221 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 108 (inactive 106 throttled 0)
  gc 0 highprio 6 throttled 146756 latency 68.9us
jepio commented 1 month ago

Here's some remarks from looking at this:

So the consequences are the following:

a) if we were to follow standard bond semantics, then we would configure netvsc with noqueue qdisc and vf with mq qdisc, as the vf is used for all the traffic. This has some awkward semantics: a netvsc with AN (accelerated networking) disabled would have qdisc, and when a VF is added we would suddenly have to move qdisc over to the VF. Same awkwardness applies to servicing events.

b) the opposite approach: netvsc mq + vf noqueue seems more logical, BUT due to the different number of queues I wonder if this comes at a perf disadvantage compared to a).

In general it seems doing this via udev is the sane approach, as that can be overridden by users. But otherwise there is an option to upstream a patch to force one of the two options through the kernel:

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 11831a1c9762..1d9f9a6e0a9a 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2349,6 +2349,8 @@ static int netvsc_prepare_bonding(struct net_device *vf_netdev)
        if (!ndev)
                return NOTIFY_DONE;

+       vf_netdev->priv_flags |= IFF_NO_QUEUE;
+
        /* set slave flag before open to prevent IPv6 addrconf */
        vf_netdev->flags |= IFF_SLAVE;
        return NOTIFY_DONE;

(this is option b), option a) can be achieved by setting ndev->priv_flags)

I'll discuss with our kernel devs to see if they have any thoughts on this issue.

jepio commented 1 month ago

This is a better rule that matches the VF (by checking the "slave" bit).

SUBSYSTEM=="net", DRIVERS!="hv_pci", ACTION=="ADD", GOTO="AZNET_END"

SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?8??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?9??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?A??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?B??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?C??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?D??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?E??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?F??", ENV{ID_NET_MANAGED_BY}="none"

ACTION=="add|change", SUBSYSTEM=="net", ENV{ID_NET_MANAGED_BY}=="none", RUN+="/sbin/tc qdisc replace dev $env{INTERFACE} root noqueue"

LABEL="AZNET_END"