MichaIng / DietPi

Lightweight justice for your single-board computer!
https://dietpi.com/
GNU General Public License v2.0
4.9k stars 499 forks source link

Odroid HC4 | CPU stall/kernel oops on boot #6278

Closed etlfg closed 10 months ago

etlfg commented 1 year ago

Creating a bug report/issue

Required Information

Additional Information (if applicable)

N/A

Steps to reproduce

  1. Download archive from the official DietPi website
  2. Extract it
  3. Flash it via dd or balenaEtcher sudo dd if=DietPi_OdroidC4-ARMv8-Bullseye_1.img of=/dev/disk/by-id/usb-Generic-_SD_MMC_MS_PRO_20121112761000000-0:0 bs=1M conv=fsync (example for dd)
  4. Plug the microSD card into the Odroid HC4 SCB
  5. Power it on

Expected behaviour

Actual behaviour

Thanks to a HDMI cable and an external screen, I can read logs and see all goes nicely until swap file creation step. The process is stuck (I let It for a day and nothing happened).

I attached a photo of what the screen looks like IMG_20230319_190327

Furthermore, even if the SSH server seems to be running, I can't access It as no IP is given on the network (checked my router admin page + nmaping).

Extra details

Maybe related to #5782 (problem with my SD card ?) or #6249 (because of latest images ?)

Thanks for the help.

MichaIng commented 1 year ago

It either hangs or crashed when editing /etc/fstab or doing a systemctl daemon-reload. I added some error handling to this part of the script: https://github.com/MichaIng/DietPi/commit/ee4d999 EDIT: Enhanced code-line and fixing an accident from a few days ago: https://github.com/MichaIng/DietPi/commit/05b1cd3

If you have a chance, you could apply this change to the flashed image before booting, so we see the exact command it fails at.

The network setup step has not been reached yet, so even that Dropbear was started, the network is still down.

But it is also possible that it just crashed, e.g. due to some voltage/load issue. Have you tried a different PSU or power cable? And did you successfully test other distros already?

ctrlbreak- commented 1 year ago

FYI - I'm encountering this same issue with a brand new HC4. I'll see if I can figure out what you mean above and provide the feedback your looking for?

EDIT: Just checked /boot/dietpi/func/dietpi-set_swapfile on the image I just downloaded ...

(FILE: DietPi_OdroidC4-ARMv8-Bullseye.img DATE: Thu Apr 13 17:24:12 UTC 2023)

... and both of those changes appear to be in place already.

ctrlbreak- commented 1 year ago

I've also pulled the current Bookworm release and it also crashed on a fresh flash...

FILE: DietPi_OdroidC4-ARMv8-Bookworm.img DATE: Thu Apr 13 17:24:02 UTC 2023

image

MichaIng commented 1 year ago

Hmm, probably not a coincidence that the crash happens while the schedutil CPU governor is applied. Modern Linux should not have any issue with that, but looks like this kernel+hardware combination has. Can you try changing it in dietpi.txt from schedutil to ondemand?

ctrlbreak- commented 1 year ago

I've tried changing the governor on both the Bookworm and Bullseye builds. No luck. On Bullseye, with a fresh flash, and the governor set to 'ondemand' before first boot, it lock up at the following (screenshot), and the 'flashing blue led heartbeat' stops too.

For a sanity check, I've pulled down ODROIDs minimal Ubuntu build (https://east.us.odroid.in/ubuntu_22.04lts/C4_HC4/ubuntu-22.04-4.9-minimal-odroid-c4-hc4-20220705.img.xz) and have been running for most of the day with it. All the same hardware... trying out some bcache and zfs stacks on this platform.

I'd obviously prefer to use DietPi though :-S

image

MichaIng commented 1 year ago

But it survives longer now, the CPU governor has been applied much earlier.

I do now recognise that in your second last log, the DietPi-PreBoot runs over 100 seconds after the kernel has been loaded. Does it always take so much time? That should be reached in 5 seconds or less.

Could you (attach a keyboard) and keep hitting random keys for some seconds and see whether it then continues? If there is an issue with the entropy daemon, it is possible that boot hangs when something tries to read from /dev/random. Hitting the keyboard creates randomness and fills the pool.

ctrlbreak- commented 1 year ago

Have a bit more info now. Finally able to revisit.

No longer think it's the governor personally, but just guessing. I managed to locate a serial adaptor and captured the console of a complete boot attempt with the latest Bullseye image (and the governor set to 'ondemand').

[   28.719563] systemd[1]: Starting DietPi-FS_partition_resize...
[   28.728812] systemd[1]: Condition check resulted in Rebuild Hardware Database being skipped.
[   28.733314] systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.
[   28.745121] systemd[1]: Starting Load/Save Random Seed...
[   28.753634] systemd[1]: Starting Apply Kernel Variables...
[   28.759989] systemd[1]: Starting Create System Users...
[   28.767674] systemd[1]: Finished Set the console keyboard layout.
[   28.771770] systemd[1]: Started Journal Service.
[   29.682669] panfrost ffe40000.gpu: error -ENODEV: _opp_set_regulators: no regulator (mali) found
[  121.646179] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  121.647757] rcu: 3-...0: (0 ticks this GP) idle=dc04/1/0x4000000000000000 softirq=2034/2034 fqs=3002
[  121.656896] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P834/2:b..l

https://pastebin.com/70uieBzt

MichaIng commented 1 year ago

Nasty, CPU stall, this time much earlier. So indeed seems to be at random time and here before the CPU governor is changed, i.e. it is not related. That stall seems to have happened here as well, being responsible for the long runtime I was wondering about.

Can you replicate the issue with one of the Armbian Bullseye images? https://www.armbian.com/odroid-hc4/ Should be mostly identical, only the bootloader is different, containing an optionally installable SPI U-Boot image, AFAIK.

ctrlbreak- commented 1 year ago

I'm still trying to learn/understand this Petitboot / U-boot stuff.

I've tried to boot "Armbian_23.02.2_Odroidhc4_bullseye_current_6.1.11.img" and the result of the serial console is at the following:

https://pastebin.com/NNMJGNMX

... it does not appear to fully boot (AFAICT). Perhaps has something to do with...

image

It's not hung, as per heartbeat led(watchdog?), but I can't really do anything.

MichaIng commented 1 year ago

This looks like it didn't really finish petitboot. Does the petitboot GUI show up on HDMI, or a timer (10 seconds by default) during which you can hit some keys to open the petitboot GUI?

The hint on the Armbian website is true for their images, but not necessarily for ours. Our boot script do generally support petitboot, but it is not 100% reliable for unknown reasons.

ctrlbreak- commented 1 year ago

Correct. I get a petitboot txt based UI menu with some options and the ability to drop to a shell, (which is what I now realize the console has dropped to as well).

Petitboot has a 10sec default timeout on the menu, but doesn't do anything after that :-/

EDIT - I now have a monitor as well as serial console connected to this brand new HC4 on my bench. I'll leave this as long as I can to try and assist in resolving things.

ctrlbreak- commented 1 year ago

I flipped sd cards and tried DietPi bullseye again with a console. It again hung in this fashion:

[   73.672135] systemd[1]: Condition check resulted in Rebuild Hardware Database being skipped.
[   73.678371] systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.
[   73.689085] systemd[1]: Starting Load/Save Random Seed...
[   73.696366] systemd[1]: Starting Apply Kernel Variables...
[   73.702806] systemd[1]: Starting Create System Users...
[   73.708900] systemd[1]: Started Journal Service.
[   74.683077] panfrost ffe40000.gpu: error -ENODEV: _opp_set_regulators: no regulator (mali) found
[  137.330214] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  137.331955] rcu:     0-...!: (0 ticks this GP) idle=e80c/1/0x4000000000000000 softirq=2919/2919 fqs=99
[  137.340923] rcu:     1-...!: (3 ticks this GP) idle=e10c/1/0x4000000000000000 softirq=2376/2379 fqs=99
[  137.350014] rcu: rcu_preempt kthread timer wakeup didn't happen for 14799 jiffies! g873 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  137.361105] rcu:     Possible timer handling issue on cpu=2 timer-softirq=780
[  137.367919] rcu: rcu_preempt kthread starved for 14800 jiffies! g873 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  137.378182] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[  137.387238] rcu: RCU grace-period kthread stack dump:
[  137.392294] rcu: Stack dump where RCU GP kthread last ran:

I noticed that the heartbeat led was still flashing after a bit, so tried to send some data via USB keyboard... no change. I then tried to send some data via my serial console and the unit immediately halted (led stopped flashing).

MichaIng commented 1 year ago

Ah, of course the Armbian image did not work with petitboot 😅. Could you try to boot it while holding the MASK key to skip the SPI bootloader?

For the DietPi image, I think petitboot can be ruled out as culprit as the same issue seems to have appeared before your first post. But to assure that you can also boot it with MASK key and see whether HDMI and serial console show the same hang/CPU stall and/or kernel oops.

ctrlbreak- commented 1 year ago

Ahhh... found the mask key and swapped back in the Armbian SD card. Booted without issue and seems stable so far. I'll do the DietPi image here in a few moments.

ctrlbreak- commented 1 year ago

It's petitboot somehow.

Using the mask key, It booted into DietPi without issue and is currently running automated upgrades and whatnot.

EDIT : I'll let this do its thing and sit idle for a couple hours (I have other things to attend to) and keep a console logging to see if something else hangs?

MichaIng commented 1 year ago

Interesting, though not great. Seems like we need to keep adding instructions on our download page/docs about how to erase petitboot on Odroid HC4 😞.

I do not really have an idea how the bootloader can have an effect of the stability once the kernel has been loaded already, but obviously it has.

Gui-Yom commented 1 year ago

Just received my Odroid HC4, I ran into the same problems as @ctrlbreak-, I had quite the ride this weekend to setup a zfs pool.

Also, current Armbian image has problems with the SPI Flash drivers (no /dev/mtd0), which turned problematic to erase the flash (flash_erase /dev/mtd0 0 0) as I had no keyboard for the Petitboot shell.

I resorted to installing Armbian from the edge builds where the drivers were working (and then went back to the current armbian distro because I had issues with finding the correct linux-headers + kernel 6.2 had license issues with zfs).

I'll try booting DietPi now that I have U-boot in the SPI. (If I get my hands on another sd card)

MichaIng commented 1 year ago

@Gui-Yom Is the SPI flash should be called /dev/mtdblock0, and as simple block device can can be erased with dd:

dd if=/dev/zero of=/dev/mtdblock0

The edge Linux headers package should be:

apt install linux-headers-edge-meson64

What do you mean with "drivers were working"? You mean while the Armbian "current" images are failing with the same CPU stall and/or kernel oops, the "edge" images are working? I could create DietPi images with edge kernel, for testing. But the edge kernel packages are still at Linux 6.2.0-rc3 (compared to 6.2.14).

(Open)ZFS always had and has the license issue with Linux: Its CDDL license is incompatible with Linux' GPL. But you can install and keep it updated quite comfortable via DKMS (which requires the matching headers package):

apt install zfs-dkms zfsutils-linux

I'm very interested into whether DietPi with mainline U-Boot on SPI works as good as U-Boot on MMC. But would be also interesting whether petitboot on SPI is able to boot from USB and/or SATA without these issues.

Gui-Yom commented 1 year ago

Is the SPI flash should be called /dev/mtdblock0

Indeed, but no mtd devices were present. (tested with ls /dev/mtd*)

What do you mean with "drivers were working"? You mean while the Armbian "current" images are failing with the same CPU stall and/or kernel oops, the "edge" images are working?

With armbian edge images, I could get access to /dev/mtdblock0 and erase it. I held the boot switch to bypass Petitboot each time before I could erase the flash.

But the edge kernel packages are still at Linux 6.2.0-rc3 (compared to 6.2.14).

Yep, I couldn't find kernel headers for 6.2.13.

(Open)ZFS always had and has the license issue with Linux: Its CDDL license is incompatible with Linux' GPL. But you can install and keep it updated quite comfortable via DKMS (which requires the matching headers package)

With kernel 6.2 (https://github.com/torvalds/linux/commit/aaeca98456431a8d9382ecf48ac4843e252c07b3), openzfs 2.1.11 won't build by default (https://github.com/openzfs/zfs/issues/14555). Which is why I then went back to armbian current. (+ I couldn't get matching headers)

Very sorry if I'm being mistaken, I'm a noob. I'll be happy to help nonetheless.

MichaIng commented 1 year ago

If I understood correctly, the same works with DietPi and Armbian "current" images, isn't it? Both should show /dev/mtdblock0 so can be flashed/erased via dd, making it even easier compared to /dev/mtd0 which requires mtd-utils to flash/erase.

Yep, I couldn't find kernel headers for 6.2.13.

Ah yes that is right, since those community images do not use the APT packages from the repo but have local kernel builds. If they do no include headers, one needs to get them manually from upstream sources + Armbian patches. But then one can just build the whole image + headers manually...

You could downgrade the kernel to the APT package from repo:

apt install --reinstall linux-{image,dtb,headers}-edge-meson64

But again, does edge really fix something compared to "current", when taking petitboot out of the game?

Ah nasty thing with the GPL enforcing symbols now breaking ZFS module builds. For your personal builds you can of course patch either Linux headers or ZFS sources to make them compatible. Could be even automated on kernel sources with an early executing /etc/kernel/postinst.d/00-* script, to replace EXPORT_SYMBOL_GPL in the conflicting files with EXPORT_SYMBOL_COPYLEFT, before DKMS runs.

ctrlbreak- commented 1 year ago

FWIW, I installed ZFS on the current DietPi Bullseye build by doing the following:

apt install linux-headers-current-meson64
apt install -t bullseye-backports zfs-dkms
apt install -t bullseye-backports zfsutils-linux

Successfully builds the ZFS kernel module:

filename: /lib/modules/6.1.11-meson64/updates/dkms/zfs.ko.xz version: 2.1.9-3~bpo11+1 license: CDDL author: OpenZFS description: ZFS

Cheers.

etlfg commented 1 year ago

It either hangs or crashed when editing /etc/fstab or doing a systemctl daemon-reload. I added some error handling to this part of the script: ee4d999 EDIT: Enhanced code-line and fixing an accident from a few days ago: 05b1cd3 If you have a chance, you could apply this change to the flashed image before booting, so we see the exact command it fails at.

I guess this code is in the latest DietPi release. I flashed the image I downloaded today

sha256sum DietPi_OdroidC4-ARMv8-Bookworm.img
e1302d865ca8fea56c98500091dae44d3b38f4d4f729632b7b5aa5bc5c5bccb2  DietPi_OdroidC4-ARMv8-Bookworm.img

The result in /etc/fstab is as follow

less /media/etlfg/bd1f6935-aa05-427e-9dec-eef7cd2fea69/etc/fstab 
# You can use "dietpi-drive_manager" to setup mounts.
# NB: It overwrites and re-creates physical drive mount entries on use.
#----------------------------------------------------------------
# NETWORK
#----------------------------------------------------------------

#----------------------------------------------------------------
# TMPFS
#----------------------------------------------------------------
tmpfs /tmp tmpfs size=1833M,noatime,lazytime,nodev,nosuid,mode=1777
tmpfs /var/log tmpfs size=50M,noatime,lazytime,nodev,nosuid

#----------------------------------------------------------------
# MISC: ecryptfs, vboxsf, glusterfs, mergerfs, bind, Btrfs subvolume
#----------------------------------------------------------------

#----------------------------------------------------------------
# SWAP SPACE
#----------------------------------------------------------------

#----------------------------------------------------------------
# PHYSICAL DRIVES
#----------------------------------------------------------------
UUID=bd1f6935-aa05-427e-9dec-eef7cd2fea69 / ext4 noatime,lazytime,rw 0 1

But it is also possible that it just crashed, e.g. due to some voltage/load issue. Have you tried a different PSU or power cable? And did you successfully test other distros already?

Latest Armbian Bookworm (Armbian_23.8.1_Odroidhc4_bookworm_current_6.1.50.img) works perfectly.

MichaIng commented 11 months ago

Sorry for the late reply. @etlfg if you still face this issue, does it help in your case to bypass petitboot with the MASK key?

On Odroid N2, the "current" kernel still misses the MTD device, but the "edge" kernel has it, which you can hence use to erase the SPI flash.

There is btw a new petitboot version. So in case USB/SATA boot is needed, it could be tested whether this solves the stability issues: https://forum.odroid.com/viewtopic.php?p=379379#p379379 EDIT: Nice, this new version solves the issue that I was not able to boot the Odroid N2 from eMMC via petitboot. Generally worth to give it a try.