Closed lbckmnn closed 2 years ago
I have noticed this change in behaviour during the Linux kernel progression. The timing problem is created by how Linux internally handles the socket receive buffer. If you dig deeper in your timing you will see the latency increase is mostly in the recv() function. It's internal handling has changed to optimize the interrupt handling, but at the cost of latency. You can get better timing by setting the socket to non-blocking: Original in nicdrv.c
/* we use RAW packet socket, with packet type ETH_P_ECAT */
*psock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ECAT));
Modified
/* we use RAW packet socket, with packet type ETH_P_ECAT */
*psock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK, htons(ETH_P_ECAT));
However this will lead to busy polling on the SOEM side and increasing CPU a lot. You decide what behaviour suits your application best.
I am experimenting with the use of ppol(). It has a timeout capability that uses the hrtimer instead of jiffies. Looks good on my platform. But it needs to be tested on all that is out there. Often what I though to be improvements turn out to be worse on some platforms.
See also #605 and #451
Another remark, do not set RT priorities over 49. The kernel priority is 50. Setting your task priority higher will lead to starvation of internal kernel processes that you depend on (f.e. socket handling). It is possible to do it on isolated CPU's though, but then again, any priority level will suffice there.
Thank you very much for the answer. I tried both, just adding SOCK_NONBLOCK and the solution from: https://github.com/OpenEtherCATsociety/SOEM/issues/451#issuecomment-1049649671 (ppoll with local DONT_WAIT in receive). The later solution seems to be the better one. I can provide a patch file or PR if needed or welcome.
However, there is still some strange behavior i can't really explain: I lose significantly more frames with a real network interface than with an USB adapter: I ran the same test as above (on Debian 11), but now with a real-time priority 40 and for 30 minutes.
USB Ethernet Adapter | RTL8111g over mini PCIe |
---|---|
dropped frames: 10 | dropped frames: 161 |
I will look into this with wireshark but i don't think these frames are actually lost.
Regarding the IRQ priorities: What is your opinion on setting the priority of the Ethernet IRQ Handler to something higher than 50?
I think you are suffering from Interrupt coalescing, You can use the ethtool tool to disable it. And for a nice write-up on packet latency see: https://blog.cloudflare.com/how-to-achieve-low-latency/ It would be nice to see how far you can drive latency down. The graphs you present are very informative.
@ArthurKetels I already executed the ethtool -C eth0 rx-usecs 0 rx-frames 1 tx-usecs 0 tx-frames 1
from drvcomment.txt
shouldn't that disable interrupt coalescing?
Some newer NIC drivers use other parameters. Use ethtool -c
to list the options for your NIC. Play around a bit to figure out what works and what doesn't.
The Output of ethtool -c eth0 is:
Coalesce parameters for eth1:
Adaptive RX: n/a TX: n/a
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a
rx-usecs: 0
rx-frames: 1
rx-usecs-irq: n/a
rx-frames-irq: n/a
tx-usecs: 0
tx-frames: 1
tx-usecs-irq: n/a
tx-frames-irq: n/a
rx-usecs-low: n/a
rx-frame-low: n/a
tx-usecs-low: n/a
tx-frame-low: n/a
rx-usecs-high: n/a
rx-frame-high: n/a
tx-usecs-high: n/a
tx-frame-high: n/a
All options with n/a seem to be unsupported and can not be changed with -C. I will try some options from your second link. edit: I tried setting the cpu affinity of the RT thread to one cpu only but that does not seem to make a big difference.
Hmm, I know the kernel has undergone some significant changes around IRQ handling around 5.10 (and is still ongoing). My suggestion is to build your own preempt-rt patched kernel. It is not that difficult. Only build those features that your really need and turn off everything else. Take the latest from kernel.org (not the Debian patched) that is supported by preempt-rt patch. I got very good results from those home build kernels.
There are many blogs posted on the internet about optimizing latency with tweaked kernels. Low hanging fruit is for example the video driver. Do not use nvidia drivers (nouveau is kinda ok). Kick out all task and socket governors.
Setting CPU affinity only helps for your task, not for kernel related latency.
I found this: https://www.spinics.net/lists/linux-rt-users/msg23900.html This seems to be the same problem. For now I ended up just installing the buster Kernel (4.19.0-17-rt-amd64). This produces basically the same plots as in my very first post and also drops no frames. The Kernel seems to work just fine with Debian 11 userland.
I would be very interested in a .config if someone manages it to build a 5.x Kernel which doesn't have these problems. Also thank you very much for your help.
Hi, I recently upgraded from Debian 10 (with installed Debian RT Kernel 4.19.0-17-rt-amd64) to Debian 11 (5.10.0-14-rt-amd64) resulting in some strange behavior in two different applications:
One Application has Problem 1.) and the other Application has Problem 2.)
I created a simple benchmark program to debug this. It is basically the same as simple test. The difference to the simple tests are:
I ran the application on a system with Debian 10 and on a System with Debian 11 for 10 minutes on a few Beckhoff terminals while also running
stress --cpu 4 --io 2 --vm 2 --vm-bytes 128M --hdd 2
. The command from drvcomment.txt was also applied.The Network interface in use seems to be a Realtek 8111g. Unfortunately problem 2) could not be reproduced as there were only two WKC mismatches but problem 1) is visible in the histogram:
I tried the same program with a cheap USB Ethernet Converter. This works just fine (with a higher latency of course,) so i guess the problem is the network driver?
also cyclictest on both systems does not indicate any problems.
Is there anything i can do about this?