Open LaurenceGough opened 10 months ago
once you boot and basic packets will flow, I doubt it's anything repair wise that would solve it
the one thing we recently changed as well is some network performance tuning
can you try these two things (together) and see if that makes it go away?
echo 0 > /proc/sys/net/core/busy_poll echo 0 > /proc/sys/net/core/busy_read
On Wed, Jan 3, 2024 at 2:20 PM LaurenceGough @.***> wrote:
Hello,
I have spent many hours today investigating this issue. I have a ClearLinux server on a mini PC. This PC has a Realtek rtl8168h Ethernet Interface.
Ever since what I assume was the last automatic system update I cannot run anything which puts a medium to high load on the network interface. CPU and local processing is all fine. No other changes have been made apart from automatic ones.
Testing with Ubuntu live USB does not have the issue. Using a USB C Ethernet adapter (same cable and port) does not have the issue. Using the live ClearLinux server USB bootable has the exact same issue. There are no related logs that I could find in the journal.
version: 6.6.9-1394.native firmware-version: rtl8168h-2_0.0.2 02/26/15 expansion-rom-version: bus-info: 0000:01:00.0 supports-statistics: yes supports-test: no supports-eeprom-access: no supports-register-dump: yes supports-priv-flags: no
lsmod | grep r8169 r8169 135168 0 mdio_devres 12288 1 r8169 libphy 225280 3 r8169,mdio_devres,realtek
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
Is the info on it.
Running a speedtest, or running any container that does moderate to high network traffic causes the vast majority of pings to drop (sometimes it's 4-5 seconds per a ping) to and from the device for up to 5 minutes (depending on how long the attempt is). 100% repeatable every time. I have confirmed everything else on the network is fine. Pings to the gateway are just fine at the same time. SSH goes down of course so I am having to console.
Perhaps an issue with the latest Stable 6.6.9-1394 Linux Kernel? Doing research on this issue finds nothing at all.
I have followed the instructions here and various other repairs but no luck. https://github.com/clearlinux/clear-linux-documentation/blob/master/source/guides/maintenance/fix-broken-install.rst
Any help would be much appreciated.
Thanks,
Laurence
— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FK7PHNR2BZKMI7JFXLYMXKR5AVCNFSM6AAAAABBMCANEGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DINZSGA3TAMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
once you boot and basic packets will flow, I doubt it's anything repair wise that would solve it the one thing we recently changed as well is some network performance tuning can you try these two things (together) and see if that makes it go away? echo 0 > /proc/sys/net/core/busy_poll echo 0 > /proc/sys/net/core/busy_read … On Wed, Jan 3, 2024 at 2:20 PM LaurenceGough @.> wrote: Hello, I have spent many hours today investigating this issue. I have a ClearLinux server on a mini PC. This PC has a Realtek rtl8168h Ethernet Interface. Ever since what I assume was the last automatic system update I cannot run anything which puts a medium to high load on the network interface. CPU and local processing is all fine. No other changes have been made apart from automatic ones. Testing with Ubuntu live USB does not have the issue. Using a USB C Ethernet adapter (same cable and port) does not have the issue. Using the live ClearLinux server USB bootable has the exact same issue. There are no related logs that I could find in the journal. version: 6.6.9-1394.native firmware-version: rtl8168h-2_0.0.2 02/26/15 expansion-rom-version: bus-info: 0000:01:00.0 supports-statistics: yes supports-test: no supports-eeprom-access: no supports-register-dump: yes supports-priv-flags: no lsmod | grep r8169 r8169 135168 0 mdio_devres 12288 1 r8169 libphy 225280 3 r8169,mdio_devres,realtek 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) Is the info on it. Running a speedtest, or running any container that does moderate to high network traffic causes the vast majority of pings to drop (sometimes it's 4-5 seconds per a ping) to and from the device for up to 5 minutes (depending on how long the attempt is). 100% repeatable every time. I have confirmed everything else on the network is fine. Pings to the gateway are just fine at the same time. SSH goes down of course so I am having to console. Perhaps an issue with the latest Stable 6.6.9-1394 Linux Kernel? Doing research on this issue finds nothing at all. I have followed the instructions here and various other repairs but no luck. https://github.com/clearlinux/clear-linux-documentation/blob/master/source/guides/maintenance/fix-broken-install.rst Any help would be much appreciated. Thanks, Laurence — Reply to this email directly, view it on GitHub <#3018>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FK7PHNR2BZKMI7JFXLYMXKR5AVCNFSM6AAAAABBMCANEGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DINZSGA3TAMY . You are receiving this because you are subscribed to this thread.Message ID: @.>
Where were you 10 hours ago? ;) ;) Hehe....
Problem solved now, I've noticed these changes are not permanent and go after a reboot, so I will look at editing the sysctl.conf to add them.
Ping times are now a rock solid <1ms as it should be. I must admit... I am not too keen on this network tuning... It has caused me to reset everything, unfortunately I've lost all of my BIOS settings and system tweaks but life goes on! I believe I am on the stable build??? if there is a more stable one such as a LTS please let me know!
Many thanks,
Laurence
(I haven't closed in case you would like me to test something else etc, please feel free to close).
the tuning is turning on a feature called "NAPI" .... which is supposed to help network performance under high load (well not just supposed, it does in our measurements)
however this needs device driver code and it appears the 8169 driver is buggy here
I'll patch our 8169 driver to not turn on NAPI in our next release so that this is permanent...
On Wed, Jan 3, 2024 at 4:34 PM LaurenceGough @.***> wrote:
once you boot and basic packets will flow, I doubt it's anything repair wise that would solve it the one thing we recently changed as well is some network performance tuning can you try these two things (together) and see if that makes it go away? echo 0 > /proc/sys/net/core/busy_poll echo 0 > /proc/sys/net/core/busy_read … <#m3932653563667953231> On Wed, Jan 3, 2024 at 2:20 PM LaurenceGough @.> wrote: Hello, I have spent many hours today investigating this issue. I have a ClearLinux server on a mini PC. This PC has a Realtek rtl8168h Ethernet Interface. Ever since what I assume was the last automatic system update I cannot run anything which puts a medium to high load on the network interface. CPU and local processing is all fine. No other changes have been made apart from automatic ones. Testing with Ubuntu live USB does not have the issue. Using a USB C Ethernet adapter (same cable and port) does not have the issue. Using the live ClearLinux server USB bootable has the exact same issue. There are no related logs that I could find in the journal. version: 6.6.9-1394.native firmware-version: rtl8168h-2_0.0.2 02/26/15 expansion-rom-version: bus-info: 0000:01:00.0 supports-statistics: yes supports-test: no supports-eeprom-access: no supports-register-dump: yes supports-priv-flags: no lsmod | grep r8169 r8169 135168 0 mdio_devres 12288 1 r8169 libphy 225280 3 r8169,mdio_devres,realtek 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) Is the info on it. Running a speedtest, or running any container that does moderate to high network traffic causes the vast majority of pings to drop (sometimes it's 4-5 seconds per a ping) to and from the device for up to 5 minutes (depending on how long the attempt is). 100% repeatable every time. I have confirmed everything else on the network is fine. Pings to the gateway are just fine at the same time. SSH goes down of course so I am having to console. Perhaps an issue with the latest Stable 6.6.9-1394 Linux Kernel? Doing research on this issue finds nothing at all. I have followed the instructions here and various other repairs but no luck. https://github.com/clearlinux/clear-linux-documentation/blob/master/source/guides/maintenance/fix-broken-install.rst https://github.com/clearlinux/clear-linux-documentation/blob/master/source/guides/maintenance/fix-broken-install.rst Any help would be much appreciated. Thanks, Laurence — Reply to this email directly, view it on GitHub <#3018 https://github.com/clearlinux/distribution/issues/3018>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FK7PHNR2BZKMI7JFXLYMXKR5AVCNFSM6AAAAABBMCANEGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DINZSGA3TAMY https://github.com/notifications/unsubscribe-auth/AAJ54FK7PHNR2BZKMI7JFXLYMXKR5AVCNFSM6AAAAABBMCANEGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DINZSGA3TAMY . You are receiving this because you are subscribed to this thread.Message ID: @.>
Where were you 10 hours ago? ;) ;) Hehe....
Problem solved now, I've noticed these changes are not permanent and go after a reboot, so I will look at editing the sysctl.conf to add them.
Ping times are now a rock solid <1ms as it should be. I must admit... I am not too keen on this network tuning... It has caused me to reset everything, unfortunately I've lost all of my BIOS settings and system tweaks but life goes on!
Many thanks,
Laurence
(I haven't closed in case you would like me to test something else etc, please feel free to close).
— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3018#issuecomment-1876153339, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FIK6UUEF4BV4GFVT3TYMX2JTAVCNFSM6AAAAABBMCANEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZWGE2TGMZTHE . You are receiving this because you commented.Message ID: @.***>
Many thanks for that. I am having real trouble getting these changes to stick, they keep reverting after a reboot. Would you have any tips? As it becomes unusable without these changes (when any containers are running) I need them to stick.
Thanks again
it's done by the clr-power-tweaks systemd service (could be with _ .. on my cell phone so can't easily check)
if you systemctl disable that service it'll stick
On Wed, Jan 3, 2024, 17:19 LaurenceGough @.***> wrote:
Many thanks for that. I am having real trouble getting these changes to stick, they keep reverting after a reboot. Would you have any tips? As it becomes unusable without these changes (when any containers are running) I need them to stick.
Thanks again
— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3018#issuecomment-1876181833, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FJMYEZZQCOYSIHK2B3YMX7TDAVCNFSM6AAAAABBMCANEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZWGE4DCOBTGM . You are receiving this because you commented.Message ID: @.***>
@fenrus75 can you confirm this is fixed in 40610?
Changes in package linux (from 6.6.9-1394 to 6.6.9-1395):
Arjan van de Ven - version bump from 6.6.9-1394 to 6.6.9-1395
Arjan van de Ven - disable napi on realtek based on issue 3018
it'll be fixed in the next -- 610 has only the first one disabled (which might be enough but might not be)
On Thu, Jan 4, 2024 at 9:49 AM Louis Hilden @.***> wrote:
@fenrus75 https://github.com/fenrus75 can you confirm this is fixed in https://cdn.download.clearlinux.org/releases/40610/clear/RELEASENOTES
— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3018#issuecomment-1877520341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FNMYKX6U6VTB7BUFLTYM3TTRAVCNFSM6AAAAABBMCANEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZXGUZDAMZUGE . You are receiving this because you were mentioned.Message ID: @.***>
Decided to checkout 40610, turns out it's completely borked for me, can't access any sudo commands, Decided to fresh install, and am ending up at the same result
Hello,
I have a mini-PC with a Realtek RTL8168 NIC and have also been struggling for a few days, and changing "busy_poll" and "busy_read" to 0 resolves the issue until the next reboot.
Yesterday evening, I tried installing 40610, and the installation environment failed to load, stopping at this point:
For comparison sake, here is 40600's installation environment, which loads successfully, albeit still with the network issue:
I also want to mention that I proceeded with the 40600 installation using the fix, but when I tried to implement the fix again on the first boot, the system became unusable after running "sudo systemctl restart NetworkManager". By unusable, I mean it seemed like the command made the system hang without any error messages, and using CTRL+C or CTRL+ALT+DEL did nothing.
I hope this helps.
hmm this is a bit sad in that it means we can't enable NAPI by default because it breaks on the realtek nics ... even though it gives a nice perf boost for other nics :(
I'm undoing all the tuning at this point -- we are not really able to only apply this for !realtek in how we do our tuning
On Fri, Jan 5, 2024 at 2:27 AM JC @.***> wrote:
Hello,
I have a mini-PC with a Realtek RTL8168 NIC and have also been struggling for a few days, and changing "busy_poll" and "busy_read" to 0 resolves the issue until the next reboot.
Yesterday evening, I tried installing 40610, and the installation environment failed to load, stopping at this point:
[image: J5Qqu6X.png] https://freeimage.host/
For comparison sake, here is 40600's installation environment, which loads successfully, albeit still with the network issue:
[image: J5QqTGt.png] https://freeimage.host/
I also want to mention that I proceeded with the 40600 installation using the fix, but when I tried to implement the fix again on the first boot, the system became unstable after running "sudo systemctl restart NetworkManager". By unusable, I mean it seemed like the command made the system hang without any error messages, and using CTRL+C or CTRL+ALT+DEL did nothing.
I hope this helps.
— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3018#issuecomment-1878448070, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FPEAX7FVAAPRA6JYYTYM7IRFAVCNFSM6AAAAABBMCANEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZYGQ2DQMBXGA . You are receiving this because you were mentioned.Message ID: @.***>
hmm this is a bit sad in that it means we can't enable NAPI by default because it breaks on the realtek nics ... even though it gives a nice perf boost for other nics :( I'm undoing all the tuning at this point -- we are not really able to only apply this for !realtek in how we do our tuning
Firstly, I want to express my gratitude for your prompt response. Although I only use Clear Linux for my home server, I truly appreciate all the time and effort invested in its development.
Now, on to the reason for my update. I noticed 40620 was released earlier, which I immediately downloaded and tested. The good news is that the installation loaded fine and was completed without any hiccups. However, after the first boot, the system will, unfortunately, stop responding quickly (less than a minute) and, leaving it long enough, messages similar to what K1ngfish3r experienced are shown:
Note that in the photo, I did try to check the network information, but the same messages are eventually shown even if I do not sign in.
the next release is pending; you can go to it with
swupd update --format staging
On Fri, Jan 5, 2024 at 4:37 PM JC @.***> wrote:
hmm this is a bit sad in that it means we can't enable NAPI by default because it breaks on the realtek nics ... even though it gives a nice perf boost for other nics :( I'm undoing all the tuning at this point -- we are not really able to only apply this for !realtek in how we do our tuning
Firstly, I want to express my gratitude for your prompt response. Although I only use Clear Linux for my home server, I truly appreciate all the time and effort invested in its development.
Now, on to the reason for my update. I noticed 40620 was released earlier, which I immediately downloaded and tested. The good news is that the installation loaded fine and was completed without any hiccups. However, after the first boot, the system will, unfortunately, stop responding quickly (less than a minute) and, leaving it long enough, messages similar to what K1ngfish3r experienced are shown:
[image: J72t0Ij.png] https://freeimage.host/
Note that in the photo, I did try to check the network information, but the same messages are eventually shown even if I do not sign in.
— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3018#issuecomment-1879461144, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ54FKOWG33NTY6XGAA2MLYNCMEZAVCNFSM6AAAAABBMCANEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZZGQ3DCMJUGQ . You are receiving this because you were mentioned.Message ID: @.***>
the next release is pending; you can go to it with swupd update --format staging
Hi,
The system hangs before swupd has time to do anything, so I attempted the update procedure you suggested from the installation environment by mounting the volume and running "swupd update --format staging --path=/mnt --statedir=/mnt/var/lib/swupd". This seemed to have finished successfully, after which I rebooted, but unfortunately, I still had the same issue.
I also noticed that I'm still on 40620, so I'm not sure if this is correct or if the update procedure failed. My assumption was that the update procedure would install 40620's successor.
Just to confirm, I downloaded and used "clear-40620-live-server.iso" to perform a clean install yesterday evening.
Thanks
Should still be pending, https://cdn.download.clearlinux.org/releases/ doesn't seem to have any updates yet
clrlinux@clr-live~ $ sudo cryptsetup open /dev/nvme0n1p2 root
Enter passphrase for /dev/nvme0n1p2:
clrlinux@clr-live~ $ sudo mount /dev/mapper/root /mnt
clrlinux@clr-live~ $ swupd update --format=staging --path=/mnt
Error: This program must be run as root..aborting
clrlinux@clr-live~ $ sudo !!
sudo swupd update --format=staging --path=/mnt
Update started
Version on server (40620) is not newer than system version (40620)
Update complete - System already up-to-date at version 40620
@fenrus75 40620 is working for me--thank you! I tested it for a day and didn't run into the NIC hang issue under heavy load.
To work around this issue I initially downgraded to 40480 (with kernel 6.6.7) using sudo swupd repair -m 40480 --force
and that worked great for several days while a fix was pending. Then I tested your fix by upgrading from 40480 to 40620 as follows:
$ sudo swupd update
Update started
Preparing to update from 40480 to 40620
Downloading packs for:
- webkitgtk
- audio-pipewire
- dav1d-lib
- not-ffmpeg-lib
- desktop-gnomelibs
- gnome-base-libs
- LibRaw-lib
- NetworkManager
- aspell
- binutils
- bison
- btrfs-progs
- c-basic
- cloud-api
- cloud-control
- containers-basic
- curl
- dev-utils
- dnf
- docker-compose
- dpdk
- editors
- emacs
- fontconfig
- gnupg
- gstreamer
- harfbuzz-lib
- inotify-tools
- iptables
- kernel-native
- kvm-host
- lib-opengl
- lib-poppler
- libX11client
- libglib
- libssh-lib
- libstdcpp
- linux-firmware
- linux-firmware-extras
- linux-firmware-wifi
- linux-tools
- llvm
- lsof
- mail-utils
- minicom
- network-basic
- nfs-utils
- notmuch
- openblas
- openssh-client
- openssh-server
- openssl
- os-core
- os-core-plus
- os-core-update
- package-utils
- parallel
- perl-basic
- polkit
- pypi-cython
- pypi-numpy
- pypi-pynacl
- python3-basic
- qt-basic
- shells
- storage-utils
- stress-ng
- sysadmin-basic
- tzdata
- vim
- vte-lib
[100%]
Finishing packs extraction...
Statistics for going from version 40480 to version 40620:
changed bundles : 65
new bundles : 4
deleted bundles : 0
changed files : 4264
new files : 16891
deleted files : 2657
Validate downloaded files
[100%]
Starting download of remaining update content. This may take a while (9909 files)...
[100%]
Installing files...
[100%]
Update was applied
Calling post-update helper scripts
External command: none
External command: pacdiscovery.service: restarted (the binary was updated)
External command: tallow.service: restarted (the binary was updated)
External command: pacrunner.service: restarted (the binary was updated)
External command: systemd-journald.service: restarted (the binary was updated)
External command: systemd-resolved.service: restarted (the binary was updated)
External command: (Took 6 seconds)
External command: systemd-timesyncd.service: restarted (the binary was updated)
Update took 162.8 seconds, 994 MB transferred
9782 files were not in a pack
Update successful - System updated from version 40480 to version 40620
Replying to https://github.com/clearlinux/distribution/issues/3018#issuecomment-1879461144
As per @lhilden I tested out fresh installing 40620, and it works. You should currently be on kernel 6.6.10 instead of 6.6.9 like your photo indicates
i@clr~ $ uname -r
6.6.10-1398.native
i@clr~ $ swupd info
Distribution: Clear Linux OS
Installed version: 40620
Version URL: https://cdn.download.clearlinux.org/update
Content URL: https://cdn.download.clearlinux.org/update
Replying to #3018 (comment)
As per @lhilden I tested out fresh installing 40620, and it works. You should currently be on kernel 6.6.10 instead of 6.6.9 like your photo indicates
Hello,
Thanks for pointing that out; I'm not sure why that was the case.
Following the comment by lhilden, I installed an older version (40580) and then proceeded to upgrade to 40620, which resolved all my issues.
To confirm, for some reason, a clean installation of 40620 didn't work, and neither did a clean installation of 40600 with an upgrade to 40620.
Hello,
I have spent many hours today investigating this issue. I have a ClearLinux server on a mini PC. This PC has a Realtek rtl8168h Ethernet Interface.
Ever since what I assume was the last automatic system update I cannot run anything which puts a medium to high load on the network interface. CPU and local processing is all fine. No other changes have been made apart from automatic ones. I can do very minor network tasks, but the second you put load on it such as downloading a bundle or running a speed test it goes.
There are no related logs that I could find in the journal.
Is the info on it.
Running a speedtest, or running any container that does moderate to high network traffic causes the vast majority of pings to drop (sometimes it's 4-5 seconds per a ping) to and from the device for up to 5 minutes (depending on how long the attempt is). 100% repeatable every time. I have confirmed everything else on the network is fine. Pings to the gateway are just fine at the same time. SSH goes down of course so I am having to console.
Perhaps an issue with the latest Stable 6.6.9-1394 Linux Kernel? Doing research on this issue finds nothing at all.
For reference the Ubuntu live USB is running: 6.2.0-26-generic Driver r8169 Version 6.2.0-26-generic Firmware version rtl8168h-2_0.0.2 02/26/15 (same)
lsmod | grep r8169 R8169 114688 0
I have followed the instructions here and various other repairs but no luck. https://github.com/clearlinux/clear-linux-documentation/blob/master/source/guides/maintenance/fix-broken-install.rst
Any help would be much appreciated.
Thanks,
Laurence