LinuxCNC / linuxcnc

LinuxCNC controls CNC machines. It can drive milling machines, lathes, 3d printers, laser cutters, plasma cutters, robot arms, hexapods, and more.
http://linuxcnc.org/
GNU General Public License v2.0
1.79k stars 1.15k forks source link

"hm2/hm2.7i96.0: error finishing read" on 5.6.x rt-kernels only (UPDATE: and 5.4.x kernels) #927

Open tinic opened 4 years ago

tinic commented 4 years ago
  1. Install a 5.6.x rt-kernel on a recent Debian based distro
  2. Install and run linuxcnc (git 2.9.0~pre0 HEAD) + Mesa 7i96 card + basic stepper config

This is what I expected to happen:

No issues

This is what happened instead:

On random intervals I get the error: "hm2/hm2.7i96.0: error finishing read." which points towards a read-timeout. With Realtek ethernet controllers it can take up to 10 minutes to get the timeout. Linuxcnc can not recover from this error, as expected.

Workaround: Set # of CPU cores in BIOS to 1 (single CPU). Disable hyperthreading.

It worked properly before this:

Install and boot a 4.19.x rt-kernel and there are no issues even with all CPUs cores + hyperthreading enabled. I have not tried to track down the injection point, could be anywhere between 4.19.x and 5.6.x kernels.

Information about my hardware and software:

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06) 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)

Intel Ethernet controllers seem to be behaving WORSE, i.e. you get the timeout almost instantly.

http://ftp.debian.org/debian/pool/main/l/linux-signed-amd64/linux-image-5.6.0-0.bpo.2-rt-amd64_5.6.14-2~bpo10+1_amd64.deb

http://ftp.debian.org/debian/pool/main/l/linux-signed-amd64/linux-image-4.19.0-10-rt-amd64_4.19.132-1_amd64.deb

Intel(R) Celeron(R) CPU J1900 @ 1.99GHz

Also repro'd on an i7.

00:00.0 Host bridge: Intel Corporation Atom Processor Z36xxx/Z37xxx Series SoC Transaction Register (rev 0e) 00:02.0 VGA compatible controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series Graphics & Display (rev 0e) 00:13.0 SATA controller: Intel Corporation Atom Processor E3800 Series SATA AHCI Controller (rev 0e) 00:14.0 USB controller: Intel Corporation Atom Processor Z36xxx/Z37xxx, Celeron N2000 Series USB xHCI (rev 0e) 00:1a.0 Encryption controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series Trusted Execution Engine (rev 0e) 00:1b.0 Audio device: Intel Corporation Atom Processor Z36xxx/Z37xxx Series High Definition Audio Controller (rev 0e) 00:1c.0 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 1 (rev 0e) 00:1c.1 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 2 (rev 0e) 00:1c.2 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 3 (rev 0e) 00:1c.3 PCI bridge: Intel Corporation Atom Processor E3800 Series PCI Express Root Port 4 (rev 0e) 00:1f.0 ISA bridge: Intel Corporation Atom Processor Z36xxx/Z37xxx Series Power Control Unit (rev 0e) 00:1f.3 SMBus: Intel Corporation Atom Processor E3800 Series SMBus Controller (rev 0e) 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06) 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06) 03:00.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 41)

andypugh commented 4 years ago

This is probably an interesting problem for the future, but are any current distributions using the 5.6 kernel? We currently expect LinuxCNC to work on kernels from 2.6.32 (yes, really) to 4.19 http://wiki.linuxcnc.org/cgi-bin/wiki.pl?MinimumSoftwareVersions

tinic commented 4 years ago

From what I remember same issue occurs with 5.4.x kernels when I fought with this a while back before finally tracking it down to the kernel last week. 5.4.x kernels are the default in Ubuntu 18.04.5 LTS and Ubuntu 20.04 LTS. As Ubuntu is rather popular I think a lot of people will run into this going forward. Would be good to at least keep the bug open so people can find it.

thomam04 commented 4 years ago

I am running Linux Mint 19.3 with Kernel 5.6.x-rt and I had similar issues. The problem for me was "IRQ coalescing".

1.) Install "ethtool" cnc@LinuxCNC:~$ sudo apt-get install ethtool

2.) Verify coalescing settings (should all read "0" / rx-usecs is usually not "0") cnc@LinuxCNC:~$ ip a

cnc@LinuxCNC:~$ ethtool -c enp3s0f0 Coalesce parameters for enp3s0f0: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0

rx-usecs: 0 rx-frames: 0 rx-usecs-irq: 0 rx-frames-irq: 0

tx-usecs: 0 tx-frames: 0 tx-usecs-irq: 0 tx-frames-irq: 0

rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0

rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0

3.) Change settings to "0" cnc@LinuxCNC:~$ sudo ethtool -C enp3s0f0 rx-usecs 0

4.) Test again

tinic commented 4 years ago

Thanks thomam04! Unfortunately your suggestion makes no difference for me as rx-usecs-irq is already 0. It seems the r8169 driver does not support this in the first place. Also, setting any parameters yields the default values, i.e. I can't for instance change rx-frames to 0, it won't take. My output:

turo@lathev2:~$ ethtool -c enp1s0 Coalesce parameters for enp1s0: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0

rx-usecs: 0 rx-frames: 1 rx-usecs-irq: 0 rx-frames-irq: 0

tx-usecs: 0 tx-frames: 1 tx-usecs-irq: 0 tx-frames-irq: 0

rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0

rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0

I've also tried to disable irqbalance and played around setting the irq affinities to have that Ethernet port in question exclusively on one CPU, to no avail.

tinic commented 4 years ago

I can confirm that the issue also occurs with 5.4.x kernels as such:

http://ftp.debian.org/debian/pool/main/l/linux-signed-amd64/linux-image-5.4.0-0.bpo.2-rt-amd64_5.4.8-1~bpo10+1_amd64.deb

tinic commented 4 years ago

Result of latency test if that matters:

latency-test

pcw-mesa commented 4 years ago

My general impression is that newer Kernels will suffer from latency issues for a fair amount of time. I would not expect 5.6 to be usable for at least a year. I had similar issues with 5.4 (bursts of Ethernet latency exceeding the timeout threshold causing read timeouts ), but the very latest 5.4 (5.4.54-rt33) seems OK at least I have had it running a 7I96 and using it as my normal desktop for about a week with no issues. Note that with the default settings, 5 timeout errors in a row will cause the "error finishing read" fault and that the default timeout is 80% of the servo period It is possible to recover from this error by setting the io_error parameter false (you would likely have to also reset the watchdog and restart any sserial ports)

pcw-mesa commented 4 years ago

My general impression is that newer Kernels will suffer from latency issues for a fair amount of time. I would not expect 5.6 to be usable for at least a year. I had similar issues with 5.4 (bursts of Ethernet latency exceeding the timeout threshold causing read timeouts ), but the very latest 5.4 (5.4.54-rt33) seems OK at least I have had it running a 7I96 and using it as my normal desktop for about a week with no issues. Note that with the default settings, 5 timeout errors in a row will cause the "error finishing read" fault and that the default timeout is 80% of the servo period It is possible to recover from this error by setting the io_error parameter false (you would likely have to also reset the watchdog and restart any sserial ports)

pcw-mesa commented 4 years ago

My general impression is that newer Kernels will suffer from latency issues for a fair amount of time. I would not expect 5.6 to be usable for at least a year. I had similar issues with 5.4 (bursts of Ethernet latency exceeding the timeout threshold causing read timeouts ), but the very latest 5.4 (5.4.54-rt33) seems OK at least I have had it running a 7I96 and using it as my normal desktop for about a week with no issues. Note that with the default settings, 5 timeout errors in a row will cause the "error finishing read" fault and that the default timeout is 80% of the servo period It is possible to recover from this error by setting the io_error parameter false (you would likely have to also reset the watchdog and restart any sserial ports)

pcw-mesa commented 4 years ago

My general impression is that newer Kernels will suffer from latency issues for a fair amount of time. I would not expect 5.6 to be usable for at least a year. I had similar issues with 5.4 (bursts of Ethernet latency exceeding the timeout threshold causing read timeouts ), but the very latest 5.4 (5.4.54-rt33) seems OK at least I have had it running a 7I96 and using it as my normal desktop for about a week with no issues. Note that with the default settings, 5 timeout errors in a row will cause the "error finishing read" fault and that the default timeout is 80% of the servo period It is possible to recover from this error by setting the io_error parameter false (you would likely have to also reset the watchdog and restart any sserial ports)

pcw-mesa commented 4 years ago

My general impression is that newer Kernels will suffer from latency issues for a fair amount of time. I would not expect 5.6 to be usable for at least a year. I had similar issues with 5.4 (bursts of Ethernet latency exceeding the timeout threshold causing read timeouts ), but the very latest 5.4 (5.4.54-rt33) seems OK at least I have had it running a 7I96 and using it as my normal desktop for about a week with no issues. Note that with the default settings, 5 timeout errors in a row will cause the "error finishing read" fault and that the default timeout is 80% of the servo period It is possible to recover from this error by setting the io_error parameter false (you would likely have to also reset the watchdog and restart any sserial ports)

pcw-mesa commented 4 years ago

My general impression is that newer Kernels will suffer from latency issues for a fair amount of time. I would not expect 5.6 to be usable for at least a year. I had similar issues with 5.4 (bursts of Ethernet latency exceeding the timeout threshold causing read timeouts ), but the very latest 5.4 (5.4.54-rt33) seems OK at least I have had it running a 7I96 and using it as my normal desktop for about a week with no issues. Note that with the default settings, 5 timeout errors in a row will cause the "error finishing read" fault and that the default timeout is 80% of the servo period It is possible to recover from this error by setting the io_error parameter false (you would likely have to also reset the watchdog and restart any sserial ports)

pcw-mesa commented 4 years ago

My general impression is that newer Kernels will suffer from latency issues for a fair amount of time. I would not expect 5.6 to be usable for at least a year. I had similar issues with 5.4 (bursts of Ethernet latency exceeding the timeout threshold causing read timeouts ), but the very latest 5.4 (5.4.54-rt33) seems OK at least I have had it running a 7I96 and using it as my normal desktop for about a week with no issues. Note that with the default settings, 5 timeout errors in a row will cause the "error finishing read" fault and that the default timeout is 80% of the servo period It is possible to recover from this error by setting the io_error parameter false (you would likely have to also reset the watchdog and restart any sserial ports)

pcw-mesa commented 4 years ago

My general impression is that newer Kernels will suffer from latency issues for a fair amount of time. I would not expect 5.6 to be usable for at least a year. I had similar issues with 5.4 (bursts of Ethernet latency exceeding the timeout threshold causing read timeouts ), but the very latest 5.4 (5.4.54-rt33) seems OK at least I have had it running a 7I96 and using it as my normal desktop for about a week with no issues. Note that with the default settings, 5 timeout errors in a row will cause the "error finishing read" fault and that the default timeout is 80% of the servo period It is possible to recover from this error by setting the io_error parameter false (you would likely have to also reset the watchdog and restart any sserial ports)

pcw-mesa commented 4 years ago

My general impression is that newer Kernels will suffer from latency issues for a fair amount of time. I would not expect 5.6 to be usable for at least a year. I had similar issues with 5.4 (bursts of Ethernet latency exceeding the timeout threshold causing read timeouts ), but the very latest 5.4 (5.4.54-rt33) seems OK at least I have had it running a 7I96 and using it as my normal desktop for about a week with no issues. Note that with the default settings, 5 timeout errors in a row will cause the "error finishing read" fault and that the default timeout is 80% of the servo period It is possible to recover from this error by setting the io_error parameter false (you would likely have to also reset the watchdog and restart any sserial ports)

jethornton commented 2 years ago

I've had this same issue on several PC running 2.9. This current one has 5.18-3-rt-amd64 kernel.

john@cave:~$ uname -a Linux cave 5.18.0-3-rt-amd64 #1 SMP PREEMPT_RT Debian 5.18.14-1 (2022-07-23) x86_64 GNU/Linux

Executing process lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: AuthenticAMD Model name: AMD Ryzen 5 5600X 6-Core Processor CPU family: 25 Model: 33 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 Stepping: 2 Frequency boost: enabled CPU(s) scaling MHz: 95% CPU max MHz: 4650.2920 CPU min MHz: 2200.0000 BogoMIPS: 7400.43 Screenshot at 2022-08-27 09-37-02

pcw-mesa commented 2 years ago

If the PC has a Intel MAC, have you disabled IRQ coalescing?

In general 5.x kernels have much worse network latency than 4.x or earlier kernels (at least on some systems)