Closed Epic-Busta closed 10 months ago
Looks like #6536 is related to this. The I/O errors they experienced look pretty similar to mine. Also after rebooting several times, the board no longer boots and isn't accessible via SSH. Best guess is that the file system is corrupted.
I did try to use raxda's kernel release, but dpkg threw an I/O error and terminated. Using powersave
governor doesn't appear to make a difference.
I'll admit, I'm very far out of my league here and talking about anything makes me think I'm very high on mount stupid. I've just done my best to collate what I've seen about this issue. I'd be happy to do any further troubleshooting and I'll post part of the output of dmesg
during an apt install
. In the meantime I'll probably just go back to using an SD card since that works just fine.
In this case, the fs didn't enter a RO state, but dpkg errored out due to I/O errors. Any attempts to repeat the installation are fruitless.
[ 209.090959] mmc1: running CQE recovery
[ 209.563136] mmc1: running CQE recovery
[ 210.812277] mmc1: running CQE recovery
[ 210.842707] mmc1: running CQE recovery
[ 210.847838] mmc1: running CQE recovery
[ 210.875998] mmc1: running CQE recovery
[ 210.883147] mmc1: running CQE recovery
[ 210.910576] mmc1: running CQE recovery
[ 210.922066] mmc1: running CQE recovery
[ 210.943918] mmc1: running CQE recovery
[ 210.948894] mmc1: running CQE recovery
[ 210.986280] mmc1: running CQE recovery
[ 212.699541] mmc1: running CQE recovery
[ 212.705554] mmc1: running CQE recovery
[ 212.712457] mmc1: running CQE recovery
[ 212.713717] blk_update_request: I/O error, dev mmcblk1, sector 1913856 op 0x1:(WRITE) flags 0x4000 phys_seg 9 prio class 0
[ 212.761081] EXT4-fs warning (device mmcblk1p1): ext4_end_bio:348: I/O error 10 writing to inode 3099 starting block 239616)
[ 212.761277] Buffer I/O error on device mmcblk1p1, logical block 233472
[ 212.761983] Buffer I/O error on device mmcblk1p1, logical block 233473
[ 212.762608] Buffer I/O error on device mmcblk1p1, logical block 233474
[ 212.763224] Buffer I/O error on device mmcblk1p1, logical block 233475
[ 212.763837] Buffer I/O error on device mmcblk1p1, logical block 233476
[ 212.764452] Buffer I/O error on device mmcblk1p1, logical block 233477
[ 212.765066] Buffer I/O error on device mmcblk1p1, logical block 233478
[ 212.765681] Buffer I/O error on device mmcblk1p1, logical block 233479
[ 212.766336] Buffer I/O error on device mmcblk1p1, logical block 233480
[ 212.766960] Buffer I/O error on device mmcblk1p1, logical block 233481
[ 213.572421] mmc1: running CQE recovery
[ 213.619948] mmc1: running CQE recovery
[ 213.730563] mmc1: running CQE recovery
[ 213.872340] mmc1: running CQE recovery
[ 213.879169] mmc1: running CQE recovery
[ 213.992361] mmc1: running CQE recovery
[ 214.187480] mmc1: running CQE recovery
[ 214.203443] mmc1: running CQE recovery
[ 214.345048] mmc1: running CQE recovery
[ 214.377660] mmc1: running CQE recovery
[ 214.385873] mmc1: running CQE recovery
[ 214.396785] mmc1: running CQE recovery
[ 214.397282] blk_update_request: I/O error, dev mmcblk1, sector 1987584 op 0x1:(WRITE) flags 0x4000 phys_seg 8 prio class 0
[ 214.406890] EXT4-fs warning (device mmcblk1p1): ext4_end_bio:348: I/O error 10 writing to inode 22492 starting block 248764)
[ 214.730775] mmc1: running CQE recovery
[ 214.803020] mmc1: running CQE recovery
[ 214.888657] mmc1: running CQE recovery
[ 215.811407] mmc1: running CQE recovery
[ 215.833115] mmc1: running CQE recovery
[ 215.839689] mmc1: running CQE recovery
[ 215.859596] mmc1: running CQE recovery
[ 215.872168] mmc1: running CQE recovery
[ 215.880326] mmc1: running CQE recovery
[ 215.886617] mmc1: running CQE recovery
[ 215.888014] blk_update_request: I/O error, dev mmcblk1, sector 580968 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[ 215.889068] blk_update_request: I/O error, dev mmcblk1, sector 580968 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[ 215.890136] Aborting journal on device mmcblk1p1-8.
[ 215.903457] mmc1: running CQE recovery
[ 215.915248] mmc1: running CQE recovery
[ 215.934097] mmc1: running CQE recovery
[ 215.943581] mmc1: running CQE recovery
[ 215.949267] mmc1: running CQE recovery
Many thanks for your report and investigation. Indeed the symptoms look pretty similar to the other issue you linked, where also an eMMC module was used. I wonder why this issue appears just now, since the kernel did not receive an update since February.
The problem with Radxa's kernel (fix) is that we use (Armbian's) mainline Linux builds. So we'd need a fix applied upstream. Probably this has already happened and we just need a recent build.
I always wanted to test building kernel packages via Armbian's build tools. So far we just used their APT repo, but for strange reasons they did not update it since February, which is already a critical security problem due to missing kernel CVE fixes. Will try that now and keep you updated with a new kernel package. If you cannot install it, I'll generate a new image to flash, with the new kernel.
Test these:
cd /tmp
curl -fO https://dietpi.com/downloads/binaries/armbian-firmware.deb
curl -fO https://dietpi.com/downloads/binaries/linux-dtb-current-rockchip64.deb
curl -fO https://dietpi.com/downloads/binaries/linux-image-current-rockchip64.deb
apt install ./armbian-firmware.deb ./linux-*
rm ./armbian-firmware.deb ./linux-*
reboot
Looks like we might need to flash with the new kernel. I just tried twice and dpkg throws an error, even when using powersave
to slow things down. Attempting to install again afterwards leads dpkg terminating in the same manner.
root@DietPi:/tmp# apt install ./armbian-firmware.deb ./linux-*
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'armbian-firmware' instead of './armbian-firmware.deb'
Note, selecting 'linux-dtb-current-rockchip64' instead of './linux-dtb-current-rockchip64.deb'
Note, selecting 'linux-image-current-rockchip64' instead of './linux-image-current-rockchip64.deb'
The following packages will be upgraded:
linux-dtb-current-rockchip64 linux-image-current-rockchip64
The following packages will be DOWNGRADED:
armbian-firmware
2 upgraded, 0 newly installed, 1 downgraded, 0 to remove and 0 not upgraded.
Need to get 0 B/56.1 MB of archives.
After this operation, 122 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 /tmp/armbian-firmware.deb armbian-firmware all 23.08.0-trunk [14.1 MB]
Get:2 /tmp/linux-dtb-current-rockchip64.deb linux-dtb-current-rockchip64 arm64 23.08.0-trunk [316 kB]
Get:3 /tmp/linux-image-current-rockchip64.deb linux-image-current-rockchip64 arm64 23.08.0-trunk [41.7 MB]
debconf: delaying package configuration, since apt-utils is not installed
dpkg: warning: downgrading armbian-firmware from 23.08.0-trunk--1-SA1251-B226d to 23.08.0-trunk
(Reading database ... 17596 files and directories currently installed.)
Preparing to unpack /tmp/armbian-firmware.deb ...
Unpacking armbian-firmware (23.08.0-trunk) over (23.08.0-trunk--1-SA1251-B226d) ...
Preparing to unpack .../linux-dtb-current-rockchip64.deb ...
Armbian 'linux-dtb-current-rockchip64' for '6.1.46-current-rockchip64': 'preinst' starting.
Armbian 'linux-dtb-current-rockchip64' for '6.1.46-current-rockchip64': 'preinst' finishing.
Unpacking linux-dtb-current-rockchip64 (23.08.0-trunk) over (23.02.2) ...
Preparing to unpack .../linux-image-current-rockchip64.deb ...
Armbian 'linux-image-current-rockchip64' for '6.1.46-current-rockchip64': 'preinst' starting.
Armbian 'linux-image-current-rockchip64' for '6.1.46-current-rockchip64': 'preinst' finishing.
Unpacking linux-image-current-rockchip64 (23.08.0-trunk) over (23.02.2) ...
dpkg: unrecoverable fatal error, aborting:
unable to fsync updated status of 'linux-image-current-rockchip64': Input/output error
E: Sub-process /usr/bin/dpkg returned an error code (2)
New image is available here: https://dietpi.com/downloads/images/testing/ You can also wait until tomorrow, where I'll rebuild all images based on DietPi v8.21.
Ah, no joy. Looks like I still got issues. It booted fine and was able to both install OpenSSH and run a 1000MiB benchmark on the eMMC, But as soon as I tried to install more stuff (I think it was Avahi and Java JRE. Program selection probably unrelated) the FS entered read only during install and there were a bunch of Buffer I/O errors like before.
I'm not sure if my eMMC is defective, but I'm able to write to it at full speed when connected to a windows PC and write images (~35-40MB/s) so I'm doubtful. Perhaps my Rock Pi 4 is defective? But there's no way to really prove that without another one. Feels a bit dead end here unfortunately.
EDIT: Uploaded bug report ref code: fbc8a1d5-9624-4e28-86de-16084c3e9542
This was in the forum post, but I'll note it here too. There looks to be some kind of process trace in dmesg with the cqhci-core driver (I think?). From a quick google search, cqhci appears to be some kind of thing for queueing commands to eMMC.
I think the trace happened during initial setup after SSHing in. There's no way to really see when during setup it happened to my knowledge.
[ 284.460966] ------------[ cut here ]------------
[ 284.460989] mmc1: cqhci: spurious TCN for tag 11
[ 284.461147] WARNING: CPU: 0 PID: 275 at drivers/mmc/host/cqhci-core.c:786 cqhci_irq+0x408/0x67c
[ 284.461202] Modules linked in: snd_soc_hdmi_codec snd_soc_spdif_tx snd_soc_audio_graph_card snd_soc_simple_card snd_soc_simple_card_utils dw_hdmi_cec snd_soc_es8316 snd_soc_rockchip_i2s panfrost dw_hdmi_i2s_audio gpu_sched btsdio drm_shmem_helper bluetooth rockchip_rng rng_core snd_soc_core brcmfmac hantro_vpu rockchip_vdec(C) brcmutil v4l2_vp9 snd_compress videobuf2_dma_contig snd_pcm_dmaengine rockchip_rga v4l2_h264 v4l2_mem2mem cfg80211 videobuf2_dma_sg snd_pcm videobuf2_memops videobuf2_v4l2 snd_timer rfkill videobuf2_common snd videodev soundcore mc cpufreq_dt dm_mod ip_tables x_tables autofs4 realtek dwmac_rk stmmac_platform stmmac pcs_xpcs
[ 284.461717] CPU: 0 PID: 275 Comm: kworker/0:1H Tainted: G C 6.1.46-current-rockchip64 #3
[ 284.461743] Hardware name: Radxa ROCK Pi 4B (DT)
[ 284.461760] Workqueue: kblockd blk_mq_run_work_fn
[ 284.461805] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 284.461831] pc : cqhci_irq+0x408/0x67c
[ 284.461861] lr : cqhci_irq+0x408/0x67c
[ 284.461887] sp : ffff800008003da0
[ 284.461900] x29: ffff800008003da0 x28: 0000000000004000 x27: ffff000000ce2480
[ 284.461946] x26: ffff0000015f0580 x25: ffff000000ce2498 x24: ffff800009571c28
[ 284.461991] x23: ffff800009c3d8f0 x22: 0000000000000002 x21: ffff000000ce2480
[ 284.462035] x20: ffff0000015f0000 x19: 000000000000000b x18: ffffffffffffffff
[ 284.462078] x17: ffff80007636e000 x16: ffff800008000000 x15: ffff800009c84066
[ 284.462122] x14: 0000000000000000 x13: 0000000000000239 x12: 00000000ffffffea
[ 284.462165] x11: 00000000ffffefff x10: 00000000ffffefff x9 : ffff800009ab8908
[ 284.462208] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000057fa8
[ 284.462251] x5 : 0000000000000fff x4 : 0000000000000000 x3 : 0000000000000027
[ 284.462293] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000040f2d00
[ 284.462337] Call trace:
[ 284.462349] cqhci_irq+0x408/0x67c
[ 284.462379] sdhci_arasan_cqhci_irq+0x54/0x84
[ 284.462408] sdhci_irq+0xc4/0x1030
[ 284.462429] __handle_irq_event_percpu+0x58/0x240
[ 284.462466] handle_irq_event+0x4c/0xc0
[ 284.462497] handle_fasteoi_irq+0xc0/0x220
[ 284.462524] generic_handle_domain_irq+0x2c/0x44
[ 284.462556] gic_handle_irq+0x50/0x130
[ 284.462586] call_on_irq_stack+0x24/0x4c
[ 284.462615] do_interrupt_handler+0xd4/0xe0
[ 284.462644] el1_interrupt+0x34/0x6c
[ 284.462671] el1h_64_irq_handler+0x18/0x2c
[ 284.462698] el1h_64_irq+0x64/0x68
[ 284.462720] preempt_count_sub+0xc/0xd4
[ 284.462743] sdhci_cqe_enable+0x13c/0x2a0
[ 284.462774] sdhci_arasan_cqe_enable+0xa4/0xf0
[ 284.462803] cqhci_request+0xd4/0x690
[ 284.462830] mmc_cqe_start_req+0xb4/0x1b0
[ 284.462861] mmc_blk_mq_issue_rq+0x448/0x880
[ 284.462888] mmc_mq_queue_rq+0x118/0x2ac
[ 284.462914] blk_mq_dispatch_rq_list+0x1a4/0x8a0
[ 284.462939] __blk_mq_sched_dispatch_requests+0xb4/0x160
[ 284.462967] blk_mq_sched_dispatch_requests+0x3c/0x80
[ 284.462994] __blk_mq_run_hw_queue+0x80/0xc0
[ 284.463026] blk_mq_run_work_fn+0x20/0x30
[ 284.463056] process_one_work+0x1fc/0x480
[ 284.463079] worker_thread+0x14c/0x444
[ 284.463100] kthread+0xf8/0xfc
[ 284.463129] ret_from_fork+0x10/0x20
[ 284.463156] ---[ end trace 0000000000000000 ]---
I'm also not sure how to know/prove whether its the ROCK 4 or the eMMC module. You could run a bad block check e.g. via Rufus from Windows, but that also does not always see actual issues. In theory, even that you can write well from Windows, it still could be the eMMC and just a certain block/area being broken that is written to after installing some software. Although, the stack trace happened earlier? dmesg -T
shows the absolute time that can be compared with/timestamps. Also journalctl
contains kernel errors and at least service start/finish logs.
Probably you can test one of Radxa's images and see whether you face the same issue when installing larger APT packages. If so, it might be at least worth it to contact Radxa support?
Apologies for the delay. Got a little busy.
I ran a block check using Rufus (2 pass) and it didn't show any errors. Then I installed the latest DietPi image and ran it. It first did a dist-upgrade (I think) It had to reboot. Then SSHing in again I installed a minimal image. There doesn't appear to be any issues thus far. I'm going to install OpenSSH (my test so far) and see what happens.
dmesg -T
output:
So far no kernel errors, let's hope it stays like that.
It just.. works?
I installed OpenSSH, Avahi-Daemon and Java JRE. There was not a single hiccup, nor any new lines in dmesg. I also ran a 2000MiB test using the built in benchmark in drive-manager. Had a write speed of 52 MiB/s and a read speed of 84 MiB/s. No new lines in dmesg regarding mmc.
I don't know if it was something you did or an update upstream, but it works now.
There was another kernel upgrade to v6.1.50, which theoretically could have fixed a bug in the MMC driver, like the Radxa commit. Or it was just some quirk on the filesystem, solved by flashing the image cleanly. Keep an eye on it for a while.
My ROCK PI 4 has had no issues for over a year now. It's been running a tailscale client and a wordpress website with basically no downtime.
I'll be closing this issue then. Thanks for looking into this :)
Required Information
Steps to reproduce
Rebooting lifts the read-only state.
Expected behaviour
System should run as normal. Similar behaviour as if run from SD or other storage media
Actual behaviour
System enters a read-only state, essentially soft-locking the system until reboot. dmesg often (but not always) has this after some write operations:
mmc1: running CQE recovery
There is usually one reported every 3-10 seconds during an apt install Also, if write operations are intensive enough, a dump of what looks like the mmc driver is reported in dmesg. (See raxda forum post at the bottom)Due to this, it's close to impossible to install anything on the system as apt will instantly lock up due to the RO state.
Extra details
This issue has appeared on the Raxda forums in January and in June a workaround was added to Raxda distributed linux images. The workaround appears to be reducing the speed of the eMMC controller from HS400-es to HS200. I haven't tested the workaround myself using their images, but I assume it works since nothing has been posted about this issue since the workaround was released.
Forum post: https://forum.radxa.com/t/rock4-se-emmc-input-output-errors-read-only-fs/14406 Git pull on raxda kernel for workaround: https://github.com/radxa/kernel/pull/127