Open naioja opened 1 month ago
Hi @naioja, Have you found that the VM is unable to achieve the line speed with the default configuration?
Hi @naioja, Have you found that the VM is unable to achieve the line speed with the default configuration?
Hi Jeremi,
At the moment a network packet will flow thru both queuing disciplines and that's not optimal and indeed that impacts performance.
Interface enP39504s1
tc -s qdisc show dev enP39504s1
qdisc fq 8002: root refcnt 33 limit 100000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b maxrate 12500Mbit low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
Sent 9906423866 bytes 6824193 pkt (dropped 0, overlimits 0 requeues 14677)
backlog 0b 0p requeues 14677
flows 110 (inactive 107 throttled 0)
gc 0 highprio 13 throttled 43812 latency 17.8us
Interface eth0
tc -s qdisc show dev eth0
qdisc fq 8001: root refcnt 65 limit 100000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b maxrate 12500Mbit low_rate_threshold 550Kbit refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
Sent 9906434943 bytes 6824221 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
flows 108 (inactive 106 throttled 0)
gc 0 highprio 6 throttled 146756 latency 68.9us
Here's some remarks from looking at this:
So the consequences are the following:
a) if we were to follow standard bond semantics, then we would configure netvsc with noqueue qdisc and vf with mq qdisc, as the vf is used for all the traffic. This has some awkward semantics: a netvsc with AN (accelerated networking) disabled would have qdisc, and when a VF is added we would suddenly have to move qdisc over to the VF. Same awkwardness applies to servicing events.
b) the opposite approach: netvsc mq + vf noqueue seems more logical, BUT due to the different number of queues I wonder if this comes at a perf disadvantage compared to a).
In general it seems doing this via udev is the sane approach, as that can be overridden by users. But otherwise there is an option to upstream a patch to force one of the two options through the kernel:
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 11831a1c9762..1d9f9a6e0a9a 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2349,6 +2349,8 @@ static int netvsc_prepare_bonding(struct net_device *vf_netdev)
if (!ndev)
return NOTIFY_DONE;
+ vf_netdev->priv_flags |= IFF_NO_QUEUE;
+
/* set slave flag before open to prevent IPv6 addrconf */
vf_netdev->flags |= IFF_SLAVE;
return NOTIFY_DONE;
(this is option b), option a) can be achieved by setting ndev->priv_flags
)
I'll discuss with our kernel devs to see if they have any thoughts on this issue.
This is a better rule that matches the VF (by checking the "slave" bit).
SUBSYSTEM=="net", DRIVERS!="hv_pci", ACTION=="ADD", GOTO="AZNET_END"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?8??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?9??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?A??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?B??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?C??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?D??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?E??", ENV{ID_NET_MANAGED_BY}="none"
SUBSYSTEM=="net", ACTION!="remove", ATTR{flags}=="0x?F??", ENV{ID_NET_MANAGED_BY}="none"
ACTION=="add|change", SUBSYSTEM=="net", ENV{ID_NET_MANAGED_BY}=="none", RUN+="/sbin/tc qdisc replace dev $env{INTERFACE} root noqueue"
LABEL="AZNET_END"
In Azure when it comes to Linux performance networking we use SR-IOV with Mellanox drivers (mlx4 or mlx5), something specific to Azure is that this creates two interfaces a synthetic and a virtual interface, documentation about it can be found here : https://learn.microsoft.com/en-us/azure/virtual-network/accelerated-networking-how-it-works.
I believe the actual bond is create by the hv_netvsc kernel module and as we can see in the output below the enP* interface is picked up by the OS as a stand-alone interface and gets a qdisc attached ot it (Azure flatcar LTS VM image output):
I believe this to be a faulty configuration in Azure flatcar VMs using SR-IOV (accelerated networking) as we usually do not apply queuing disciplines to bridged or bonded interfaces like docker0 or virbr0.
This has other implications on how systemd applies the default setting for net.core.default_qdisc too, misusing the file /lib/sysctl.d/50-default.conf.
One of the (simple) fixes that I found was to apply the below “tuned” udev configuration for interface queuing disciplines:
Specifically for my tests I’ve set the maxrate for the fq queuing discipline to match the VM SKU interface line speed of 12.5 Gbit and I’ve upped the limit of processed packets to 100K since the limits set upstream are a bit random and arbitrary and may not be the best choice for running VMs in the cloud and expecting higher networking performance.