Joshua-Riek / ubuntu-rockchip

Ubuntu for Rockchip RK35XX Devices
https://joshua-riek.github.io/ubuntu-rockchip-download/
GNU General Public License v3.0
2.4k stars 257 forks source link

Orange Pi 5 & 5 Plus Ethernet speed cuts down to ~10Mbit/s after few days of uptime #402

Closed artem-zinnatullin closed 1 year ago

artem-zinnatullin commented 1 year ago

I run Ubuntu-Rockchip v1.27 (and all previous versions since March) and what I observe on both my Orange Pi 5 and Orange Pi 5 Plus the Ethernet speeds are dropping down to ~10Mbit/s from full 1 Gbit/s after few days of uptime.

Reboot fixes it for few days and then it happens again.


Before reboot:

$ speedtest-cli
Testing download speed................................................................................
Download: 11.83 Mbit/s
Testing upload speed......................................................................................................
Upload: 13.46 Mbit/s
$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
inet xx.xx.xx.xx  netmask 255.255.255.0  broadcast 
inet6 yy::yy:yy:yy:yy  prefixlen 64  scopeid 0x20<link>
ether zz:zz:zz:zz:zz:zz  txqueuelen 1000  (Ethernet)
RX packets 529419918  bytes 594250458074 (594.2 GB)
RX errors 0  dropped 799  overruns 0  frame 0
TX packets 211201142  bytes 17726545183 (17.7 GB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
device interrupt 99
$ uptime
17:31:22 up 2 days, 23:18,  2 users,  load average: 7.13, 6.58, 6.39

After reboot in few minutes:

$ speedtest-cli
Testing download speed................................................................................
Download: 317.80 Mbit/s
Testing upload speed......................................................................................................
Upload: 95.35 Mbit/s
$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
inet xx.xx.xx.xx  netmask 255.255.255.0  broadcast
inet6 yy::yy:yy:yy:yy  prefixlen 64  scopeid 0x20<link>
ether zz:zz:zz:zz:zz:zz  txqueuelen 1000  (Ethernet)
RX packets 1177254  bytes 1418920724 (1.4 GB)
RX errors 0  dropped 220  overruns 0  frame 0
TX packets 439466  bytes 293043005 (293.0 MB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
device interrupt 99

Orange Pi 5:

Orange Pi 5 Plus:

Other:


Happy to provide more data and do more tests (I checked scp local-local machines before and it was equally bad so it's not internet connection issue but forgot to do that before reboot, can include that in few days).

Joshua-Riek commented 1 year ago

Interesting, thanks for the detailed info. Would you happen to have the dmesg logs from the system when this issue happens?

Joshua-Riek commented 1 year ago

Also, I do have some mainline Linux images for the Orange Pi 5 series that might work better for server use, the only thing not working is the GPU, HDMI, and NPU.

https://github.com/Joshua-Riek/ubuntu-rockchip/actions/workflows/build-mainline.yml

ewaldc commented 1 year ago

I have a Orange Pi 5 Plus, but I am not seeing the issue after 5 days of uptime (had to cut the power to install a heatpump and get off gas). I am using the Orange PI 5+ as a NAS device, and did notice some slow down after a few weeks but more like a 50% slow down (Samba read 100MB/s -> 50MB/s). I have done an "apt update/upgrade" today and will test/report how the speed evolves on my system.
PS. There are some differences between the RTL8125 driver published by Realtek and the one published in the Orange Pi Kernel repo, but it's 2 months ago I looked at this.

ewaldc commented 1 year ago

After 3 weeks of uptime: no slowdown measured using speedtest-cli --secure. Also no deterioration in Samba read or write performance (1GB file write: 119MB/s, 1GB file read 120MB/s, switch: TP-link 8x1Gbit). I did notice though that dmesg is flooded with countless lines of (worth hundreds of MB in size):

[414169.170382] r8125 0004:41:00.0 enP4p65s0: rss get rxnfc
[414292.005136] r8125 0004:41:00.0 enP4p65s0: rss get rxnfc
[414414.765718] r8125 0004:41:00.0 enP4p65s0: rss get rxnfc
[414540.182620] r8125 0004:41:00.0 enP4p65s0: rss get rxnfc
[414660.271381] r8125 0004:41:00.0 enP4p65s0: rss get rxnfc
[414780.875479] r8125 0004:41:00.0 enP4p65s0: rss get rxnfc
[414903.999129] r8125 0004:41:00.0 enP4p65s0: rss get rxnfc

Looking at the source code, it seems related to RSS. From cat /proc/interrupts it shows there are 32 irq handlers registered per LAN port. Possiby the RSS function is trying to balance or relocate the irq handlers. The message itself is created by:

netif_info(tp, drv, tp->dev, "rss get rxnfc\n");

in r8125_rss.c. This line should be deleted or changed to netif_dbg...

Based on your configuration, I would guess your OS is on the NVME drive(s) but if it would happen to be on an SD card or other medium with slower write speeds, than the massive logging to a slower medium (or a file system full?) could possibly explain the slow down.

Joshua-Riek commented 1 year ago

Thanks for the detailed observation, I can make a quick kernel patch so the kernel log will not be spammed with this message.

ewaldc commented 1 year ago

That would be great! As I regular kernel programmer, I would be happy to help and contribute PR's, but I am not familiar enough of where I can find all the bits and pieces of your Linux build (e.g. where you pull the kernel from) to be able to contribute in the form of PR's. A few months ago, I looked into the lastest published Realtek driver code (9.011.01) and found a few issues as well as things that could be improved for better performance (e.g. page reuse)., but have not had the time lately to do proper testing of my changes.

PS. If you need help with some (code) issue or to test something out (on the Orange Pi5+), feel free to reach out.

shvetsnikita commented 1 year ago

@ewaldc

where you pull the kernel from

I guess from here: https://github.com/Joshua-Riek/linux-rockchip

Joshua-Riek commented 1 year ago

Yeah, the thing about the kernel is it's a hacked Android kernel, so there are so many bugs.

artem-zinnatullin commented 1 year ago

Hi folks, I'm dedicating my 2nd orangepi5 to debug this issue.

I've been running fine with network stack restart in cron for about a month:

0 9 * * * /etc/init.d/networking restart

The issue is that I see reports of GPU and NPU not working in recent releases and software I use relies on both GPU and NPU (video processing & object recognition) so it'll be hard to upgrade for me to get this log patch tested: https://github.com/armbian/linux-rockchip/pull/114

But otherwise I'm quiet condident I'll get the issue reproduced within few days due to amount of traffic my OrangePi s handle from 4k & 2k cameras 24/7.

Joshua-Riek commented 1 year ago

Hi folks, I'm dedicating my 2nd orangepi5 to debug this issue.

I've been running fine with network stack restart in cron for about a month:

0 9 * * * /etc/init.d/networking restart

The issue is that I see reports of GPU and NPU not working in recent releases and software I use relies on both GPU and NPU (video processing & object recognition) so it'll be hard to upgrade for me to get this log patch tested: armbian/linux-rockchip#114

But otherwise I'm quiet condident I'll get the issue reproduced within few days due to amount of traffic my OrangePi s handle from 4k & 2k cameras 24/7.

You dont need to worry about this, GPU and NPU not working are related to mainline Linux (6.6.x) not the Rockchip Linux 5.10.160.

Joshua-Riek commented 1 year ago

I will be closing this as it can not be reproduced.

artem-zinnatullin commented 8 months ago

Hi @Joshua-Riek, apologies for late addition to the report. The issue still occurs regularly as originally reported on both of my Orange Pi 5 and Orange Pi 5 Plus boards.

I've configured Speedtest.net integrations on HomeAssistant (runs on Orange Pi 5) and it now collects regular data, here is graph where speed drops from my upstream internet limit of ~400Mbit/s (download) and ~100Mbit/s (upload) down to 14Mbit/s download and 14Mbit/s upload within 3 days of uptime and only reboot helps to bring it back.

I'm on your latest available kernel (apt-get update && apt-get upgrade regularly):

image

uname -a
Linux orangepi5n1 5.10.160-rockchip #31 SMP Mon Feb 12 15:49:56 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

What logs should I try to collect when it Ethernet speed drops?

ewaldc commented 8 months ago

@artem-zinnatullin, I have set up a similar monitor (timed daily 1GB file transfer over Samba to/from Windows) and I am now also seeing the issue. In addition, for my systems, a reboot is now no longer solving the issue! Does reboot restore the performance for you? Last fall, I could only reproduce a 50% drop in performance over ~3 weeks and a reboot brought things back to 100MB/s (limited by my 1Gb switch). It seems things have gotten worse since last fall... For tools, I use 'ethtool -S enP4p65s0' (ethernet card and driver stats) and 'journalctl -b 1' (kernel, syslog etc.), but I could not see anything obvious like packets dropped, CRC errors or collisions. The kernel/sys logs contain plenty of errors/warnings/failures and, as Joshua mentioned, indicate a buggy kernel, but nothing seems to stand out. Except regular updates, the only things I changed is to add WIFI cards. Will take one out to see if that makes a difference.

Joshua-Riek commented 8 months ago

I'm working on a new 6.1 kernel and I recently backported the r8125 driver to it. Maybe I can also update the driver in 5.10.

Please note the 6.1 kernel is intended to be used with the upcoming Ubuntu 24.04 release in April.

ewaldc commented 8 months ago

@Joshua-Riek , wonderful. Let me know if I can do anything to help. One thing to add: while I am noticing a ~10x drop in network throughput versus the fall of last year, I am still getting 10 to 11MB/s throughput compared to ~100MB/s before. That is still ~10x better than what @artem-zinnatullin reports. It makes me think there is something more involved than just the r8125 driver.

artem-zinnatullin commented 8 months ago

@Joshua-Riek happy to upgrade to 6.x kernel as soon as it runs with GPU and NPU support and/or provide more logs and data from 5.x 👍 Can we please get this issue reopened since @ewaldc is now consistently observing it too?

@ewaldc reboot every 2-3 days is the only thing that helps restore the performance in my case. It's also interesting that on my graph you can see how performance gradually goes down over time, perhaps some buffer starts accumulating dead objects and network packets stop being buffered thus the throughput drop?

In my case I'm running 24/7 full resolution video feed analysis from 12 4k & 2k cameras on two Orange Pis with Frigate doing object recognition on NPU and ffmpeg decoding the H.256 and H.264 streams on GPU. Both Orange Pis have NVMe drives, no microSDs. I also use VLANs (hopefully that's not relevant).

Joshua-Riek commented 8 months ago

The 6.1 kernel is ready to be released in a beta state. But it does require a more recent version of mpp and ffmpeg which I do have working properly. My original intention was to release the 6.1 kernel with Ubuntu 24.04, so any introduced issues by the new kernel would be specific to the new Ubuntu version. But I may release a kernel package for 6.1 so users on Ubuntu 22.04 can upgrade the kernel at their own risk. However, my attention is focused on a few regression issues that are unrelated to the 6.1 kernel.

ewaldc commented 8 months ago

@Joshua-Riek, IMHO, a possible value of releasing the 6.1 kernel (without any support of course) on 22.04 would be to test it ahead of Ubuntu 24.04. It could also help to provode a baseline for comparison of issues/behaviors between 24.04 and 22.04 since both systems would be running the same kernel (easier to determine if an issue is kernel or OS related).

Joshua-Riek commented 8 months ago

Yeah, I've been doing just that. Systemd was updated recently and it broke the bootstrapping process of creating new Ubuntu 24.04 images, so I'm stuck on Ubuntu 22.04 for the moment.