LinuxCNC / linuxcnc

LinuxCNC controls CNC machines. It can drive milling machines, lathes, 3d printers, laser cutters, plasma cutters, robot arms, hexapods, and more.
http://linuxcnc.org/
GNU General Public License v2.0
1.78k stars 1.14k forks source link

Network latency breaks Mesa ethernet connection #2281

Open rodw-au opened 1 year ago

rodw-au commented 1 year ago

I'm raising this as an issue but its really an upstream Debian issue

Refer this 22 page thread on the forum here: https://forum.linuxcnc.org/27-driver-boards/46911-mesa-hm2-hm2-7i96s-0-error-finishing-read?start=0 This seems to affect Realtek cards more than other NIC's but thats not exclusive. I suspect the Realtek cards are slower than Intel NICs. One user also reported timeout errors on ethercat with larger numbers of slaves.

Excessive latency on the patched Debian kernel leaves insufficient time for the realtime thread to communicate with the Mesa card during a servo thread cycle. The mesa card then shuts down and operation is disabled intil a restart. Testing by various users in addition to myself shows the issue was introduced in linux-image-rt (preempt_rt) kernel 5.10 (Bullseye) and still exists at kernel 6.10 Current kernel)

Compiling a PREEMPT_RT kernel from upstream (kernel.org) source solves the issue so clearly its due to something in the Debian patches at https://salsa.debian.org/kernel-team/linux/-/tree/master/debian/patches-rt

For assistance with reviewing this I have compiled .deb files for linux-headers and the preempt_rt kernel for kernel version 6.1 from pristine sources here https://drive.google.com/drive/folders/10uwGg5RvZDDlLtQ8BZhM3At_gODk16na Some of the ancilliary files outline the steps I took to compile this kernel. I posted a sticky on the forum here: https://forum.linuxcnc.org/9-installing-linuxcnc/47696-installing-linuxcnc-and-debian-bookworm-on-problematic-hardware-eg-realtek-nic A number of users have reported this solve their issue.

I did reoprt this issue with debian here https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1022170 but at that time, I did not fully understand the issue. It would be great if someone with more authority could raise this as an issue against the linux-image-rt (5.10 to 6.1)

I am raising this because once Bookworm becomes the Debian stable branch in a few short months, many users will be affected by this issue and that will reflect poorly on this project.

petterreinholdtsen commented 1 year ago

Sound correct. Make sure you have a low latency network and network card (preferable a dedicated one), to avoid unpredictable latency.

-- Happy hacking Petter Reinholdtsen @.***

rodw-au commented 1 year ago

Sound correct. Make sure you have a low latency network and network card (preferable a dedicated one), to avoid unpredictable latency. -- Happy hacking Petter Reinholdtsen @.***

Its not as simple as that. The hardware is low latency and works on kernels up to 4.9. The network with mesa is point to point on dedicated network segment so is low latency by definition. It fails from bullseye on with any Debian provided kernel.

petterreinholdtsen commented 1 year ago

[Rod Webster]

Its not as simple as that. The hardware is low latency and works on kernels up to 4.9. The network with mesa is point to point on dedicated network segment so is low latency by definition. It fails from bullseye on with any Debian provided kernel.

OK. The original message was scarse on information, and I guess we still need more information to understand how this is an issue with linuxcnc. It seem like you are saying that Linux kernels above 4.9 introduce unwanted latency. If so, I would guess that have to be address in the Linux kernel, not LinuxCNC. Can you tell why you believe this is an issue with LinuxCNC?

-- Happy hacking Petter Reinholdtsen

rodw-au commented 1 year ago

It seem like you are saying that Linux kernels above 4.9 introduce unwanted latency. If so, I would guess that have to be address in the Linux kernel, not LinuxCNC. Can you tell why you believe this is an issue with LinuxCNC? -- Happy hacking Petter Reinholdtsen

Yes, The Debian RT kernel patches commencing with the RT 5.10 kernel introduce network latency that is not present in the upstream kernel.org sources. Where Linuxcnc is communicating in realtime to devices on an ethernet connection and latency exceeds acceptible limits, an "error finishing read" is generated and the device is disabled. This "bricks" Linuxcnc rendering it useless. A program restart is required to clear the error but this does not solve the problem and it will reocurr. I have experienced this on 4-5 PC's personally and many other experienced early adopters have been affected by this issue. It has held me up on one project for many months while seeking a resolution. In many instances later kernels are required to provide driver support to the affected PC's so staying on 4.x kernels is not a viable option.

Once Linuxcnc rolls out Version 2.9, it will be on the Bookworm or Bullseye platform. A large number of users will be impacted as existing working PC hardware will fail on the later Debian kernels. This will reflect poorly on this project.

Clearly, the Linuxcnc project will be significantly impacted if no action is taken when Bookworm is released.

It would be prudent for the Linuxcnc project to take up this upstream issue with the Debian linux-image-rt developers in an attempt to have it resolved before the impending hard freeze.

I don't have the skills to review the kernel patches. If I was to hazzard a guess at the problem, I would investigate the "lazy preemption" patches which I think were introduced in RT kernel 5.10.

ihabmmali commented 1 year ago

I concur with the above analysis. I was struggling with the exact same issue thinking it was a hardware problem with the mini PC I was using. Reverting back to RT kernel 4.9 resolved all the problem immediatly. The problem is the compiled versions of 5.10 and 6.0 that are available on the debian repos. These are the kernel versions one uses if following the linuxcnc documentation and compiling for a RIP or package based installation.
I have not attempted to compile either from source but the information provided above makes sense. It's not necessarily a linuxcnc issue but highlighting here is useful to a) ensure we are tracking the problem and b) prompt communications with the Debian team to correct the issues with the compiled version of the kernels available on their official repos.

rodw-au commented 1 year ago

I concur with the above analysis. I was struggling with the exact same issue thinking it was a hardware problem with the mini PC I was using. Reverting back to RT kernel 4.9 resolved all the problem immediatly.

You were fortunate that the 4.9 kernel remained viable for your hardware. If it was equipped with say a Realtek R8125 NIC used in many recent mini PC's, driver support was not introduced until the 5.9 kernel so Bullseye or later is required short of using the manufacturer's NIC driver. But even that has become problematic as Realtek support up to the 5.19 kernel and Bookworm has moved past that now.