Hexxeh / rpi-firmware

Firmware files for the Raspberry Pi
Other
775 stars 208 forks source link

CPU1: failed to come online with 5.4.51-v7l+ #232

Closed wagnerch closed 3 years ago

wagnerch commented 3 years ago

Appears to be something in commit 7059841, because 8382ece booted fine, and I ended up going back to da3752a (5.4.49) and that is also fine.

kern.log

Jul 17 20:54:17 kernel: [    0.000000] Booting Linux on physical CPU 0x0
Jul 17 20:54:18 kernel: [    0.000000] Linux version 5.4.51-v7l+ (dom@buildbot) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611)) #1326 SMP Fri Jul 17 10:51:18 BST 2020
Jul 17 20:54:18 kernel: [    0.000000] CPU: ARMv7 Processor [410fd083] revision 3 (ARMv7), cr=30c5383d
Jul 17 20:54:18 kernel: [    0.000000] CPU: div instructions available: patching division code
Jul 17 20:54:18 kernel: [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
Jul 17 20:54:18 kernel: [    0.000000] OF: fdt: Machine model: Raspberry Pi 4 Model B Rev 1.2
...
Jul 17 20:54:18 kernel: [    0.003438] CPU: Testing write buffer coherency: ok
Jul 17 20:54:18 kernel: [    0.003949] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
Jul 17 20:54:18 kernel: [    0.004818] Setting up static identity map for 0x200000 - 0x20003c
Jul 17 20:54:18 kernel: [    0.005022] rcu: Hierarchical SRCU implementation.
Jul 17 20:54:18 kernel: [    0.005678] smp: Bringing up secondary CPUs ...
Jul 17 20:54:18 kernel: [    1.041640] CPU1: failed to come online
Jul 17 20:54:18 kernel: [    1.042925] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002
Jul 17 20:54:18 kernel: [    1.044188] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003
Jul 17 20:54:18 kernel: [    1.044333] smp: Brought up 1 node, 3 CPUs
Jul 17 20:54:18 kernel: [    1.044390] SMP: Total of 3 processors activated (324.00 BogoMIPS).
Jul 17 20:54:18 kernel: [    1.044415] CPU: All CPU(s) started in HYP mode.
Jul 17 20:54:18 kernel: [    1.044437] CPU: Virtualization extensions available.
popcornmix commented 3 years ago

I believe I'm on that version:

pi@pi4:~ $ uname -a
Linux domnfs 5.4.51-v7l+ #1326 SMP Fri Jul 17 10:51:18 BST 2020 armv7l GNU/Linux
pi@pi4:~ $ vcgencmd version
Jun 10 2020 17:47:19 
Copyright (c) 2012 Broadcom
version e46bba1638cca2708b31b9daf4636770ef981735 (clean) (release) (start)
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 5.4.51-v7l+ (dom@buildbot) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611)) #1326 SMP Fri Jul 17 10:51:18 BST 2020
[    0.000000] CPU: ARMv7 Processor [410fd083] revision 3 (ARMv7), cr=30c5383d
[    0.000000] CPU: div instructions available: patching division code
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
[    0.000000] OF: fdt: Machine model: Raspberry Pi 4 Model B Rev 1.2
...
[    0.003385] CPU: Testing write buffer coherency: ok
[    0.003874] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[    0.004715] Setting up static identity map for 0x200000 - 0x20003c
[    0.004915] rcu: Hierarchical SRCU implementation.
[    0.005550] smp: Bringing up secondary CPUs ...
[    0.006707] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
[    0.007980] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002
[    0.009185] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003
[    0.009328] smp: Brought up 1 node, 4 CPUs
[    0.009398] SMP: Total of 4 processors activated (432.00 BogoMIPS).
[    0.009424] CPU: All CPU(s) started in HYP mode.
[    0.009447] CPU: Virtualization extensions available.

I don't see anything is the list of changes that seems likely to cause this. Does it happen every time, or was it just once?

wagnerch commented 3 years ago

I don't see anything is the list of changes that seems likely to cause this. Does it happen every time, or was it just once?

I rebooted twice, and both times it came back with CPU offline. Honestly didn't even notice it until this morning when I ran htop. The kernel log shows every other boot other than the July 17th kernel build comes up fine. I am wondering if it is something with the bootloader, that seems to have been updated as well. I rolled back everything to the 5.4.49 commit.

Edit: When I say bootloader, I guess I am talking about maybe second stage? /boot/start.elf

wagnerch commented 3 years ago

The other thing I noticed is there is only 3 berries on the boot. First reboot 4 berries, second reboot 3 berries, so far.

popcornmix commented 3 years ago

nproc will return number of processors detected and number of berries will match that. It would be useful to do a number of reboots on https://github.com/Hexxeh/rpi-firmware/commit/70598414a73afab7f8e521c358a7cfd5ffb65d3e and a number of reboots on https://github.com/Hexxeh/rpi-firmware/commit/8382ece2b30be0beb87cac7f3b36824f194d01e9 and note how many processors are detected with each.

wagnerch commented 3 years ago

OK, looks like it may have been introduced with 8382ece (5.4.51). Just counting berries on the screen this is what I found:

5.4.50 bafd743 4444444444, all 10 reboots came up with 4 berries
5.4.51 8382ece 4433343433, 6/10 reboots came up with 3 berries
popcornmix commented 3 years ago

Can you confirm if it was a firmware or kernel change? e.g. start with 8382ece sudo SKIP_KERNEL=1 rpi-update bafd743 should give you firmware from bafd743 but still kernel from 8382ece. Is that okay?

wagnerch commented 3 years ago

Commands executed:

sudo \
SKIP_WARNING=1 \
UPDATE_SELF=0 \
    rpi-update 8382ece

sudo \
SKIP_WARNING=1 \
UPDATE_SELF=0 \
SKIP_KERNEL=1 \
    rpi-update bafd743

$ reboot
$ vcgencmd version
Jul  2 2020 14:59:18
Copyright (c) 2012 Broadcom
version 36c8be9515deddc9d2b1f469374f00d0a2df13f9 (clean) (release) (start)

$ uname -r
5.4.51-v7l+

4/6 reboots resulted in 3 berries.

popcornmix commented 3 years ago

I believe that is suggesting a kernel issue starting from 8382ece. @pelwell any thoughts? https://github.com/raspberrypi/linux/pull/3703/commits/cc5c7ce6d3218cab2b886364a824471b2acef277 ?

wagnerch commented 3 years ago

@popcornmix Reverted that commit, rebuilt 5.4.51, rebooted 10 times and no problems. All 4 cores are coming up with all 10 reboots.

$ vcgencmd version
Jul 17 2020 10:59:17
Copyright (c) 2012 Broadcom
version 21a15cb094f41c7506ad65d2cb9b29c550693057 (clean) (release) (start)

$ uname -rmvs
Linux 5.4.51-v7l+ #1 SMP Sat Jul 18 15:19:16 UTC 2020 armv7l
popcornmix commented 3 years ago

Just to be absolutely sure, with your own built kernel and that commit not reverted, is it failing?

wagnerch commented 3 years ago

I haven't tried it, but the base was commit 9d49ae69a1448f2180229b82794bfaa1c78679f7.

commit 948290923306a7302a14869beae7a560f67cef94 (HEAD -> rpi-5.4.y)
Author: Chad Wagner <wagnerch42@gmail.com>
Date:   Sat Jul 18 11:10:49 2020 -0400

    Revert "irqchip/bcm2835: Quiesce IRQs left enabled by bootloader"

    This reverts commit d178d70080f4691a4a5cb69b116d9b7fba4b5e16.

commit 9d49ae69a1448f2180229b82794bfaa1c78679f7 (raspberrypi/rpi-5.4.y)
Author: Phil Elwell <phil@raspberrypi.com>
Date:   Fri Jul 17 17:56:17 2020 +0100

    configs: Add MAXIM_THERMOCOUPLE=m

    See: https://github.com/raspberrypi/linux/issues/3732

    Signed-off-by: Phil Elwell <phil@raspberrypi.com>
wagnerch commented 3 years ago

Rebuilt using raspberrypi/linux@9d49ae69a144 and it's also totally fine (all CPUs are coming up for 10 reboots). So I used rpi-update to switch back to 8382ece and it still has the CPU problem about 60% of the time.

Either something on rpi-5.4.y fixed it after the build (seems unlikely since I only see one additional commit) or maybe the build host has a local git repository that is out of sync. Or something else?

popcornmix commented 3 years ago

Are you updating kernel, modules and dtbs after building your own? If you start with a problematic rpi-update versions and then update with your built kernel does it cure the problem?

pelwell commented 3 years ago

I don't buy that https://github.com/raspberrypi/linux/commit/cc5c7ce6d3218cab2b886364a824471b2acef277 is the cause - armctrl_of_init isn't even called on Pi 4, and neither is bcm2836_arm_irqchip_l1_intc_of_init. However, in confirming this I did just see CPU1 fail to come online:

[    0.004958] rcu: Hierarchical SRCU implementation.
[    0.009544] smp: Bringing up secondary CPUs ...
[    1.041633] CPU1: failed to come online
[    1.043883] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002
[    1.046066] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003
[    1.046212] smp: Brought up 1 node, 3 CPUs

With extra debugging (not emitted either) it started OK - something seems a bit marginal.

I've also noticed some reboot failures - the firmware stopping with 7 short flashes, which means "kernel not found".

wagnerch commented 3 years ago

I generally do not build the kernel, usually what you guys provide is fine. But yes, I update modules, dtbs, dtb overlays, and kernel. This is the script I use to cross-compile from Ubuntu 18.04 (x64), it spits out a tarball that can be unrolled from "/" on the pi. I run it with an argument of "arm", and I am using GCC 7.5.0 cross-compiler: arm-linux-gnueabihf-gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

#!/bin/bash
ARCH=$1
case $ARCH in
   arm64)
      KERNEL=kernel8
      CROSS_COMPILE=aarch64-linux-gnu-
      ;;
   arm)
      KERNEL=kernel7l
      CROSS_COMPILE=arm-linux-gnueabihf-
      ;;
   *)
      echo "No architecture specified."
      exit 1
      ;;
esac

REV=$(git rev-parse --short HEAD)
KBASE=/tmp/kernel
NPROC=$(/usr/bin/nproc)
NPROC=$(( NPROC + NPROC / 2 ))

test -d ${KBASE} && rm -fr ${KBASE}
mkdir -p ${KBASE}/boot/overlays

make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} bcm2711_defconfig
KVER=$(make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} -s kernelrelease)
make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} Image modules dtbs
make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} INSTALL_MOD_PATH=${KBASE} modules_install

rm -f ${KBASE}/lib/modules/${KVER}/build ${KBASE}/lib/modules/${KVER}/source
cp arch/${ARCH}/boot/Image ${KBASE}/boot/${KERNEL}.img
case $ARCH in
   arm64)
      cp arch/${ARCH}/boot/dts/broadcom/*.dtb ${KBASE}/boot/
      ;;
   arm)
      cp arch/${ARCH}/boot/dts/*.dtb ${KBASE}/boot/
      ;;
esac
cp arch/${ARCH}/boot/dts/overlays/*.dtb* ${KBASE}/boot/overlays/
cp arch/${ARCH}/boot/dts/overlays/README ${KBASE}/boot/overlays/
tar cvzf ~/kernel-${ARCH}-${KVER}-${REV}.tar.gz -C ${KBASE} .
make -j${NPROC} ARCH=${ARCH} CROSS_COMPILE=${CROSS_COMPILE} mrproper

It is always CPU 1 that fails to come online:

Jul 17 20:54:18 kernel: [    1.041640] CPU1: failed to come online
Jul 18 08:27:28 kernel: [    1.041632] CPU1: failed to come online
Jul 18 09:45:23 kernel: [    1.041638] CPU1: failed to come online
Jul 18 09:47:32 kernel: [    1.041633] CPU1: failed to come online
Jul 18 09:50:32 kernel: [    1.041635] CPU1: failed to come online
Jul 18 09:52:24 kernel: [    1.041633] CPU1: failed to come online
Jul 18 10:02:18 kernel: [    1.041637] CPU1: failed to come online
Jul 18 10:02:42 kernel: [    1.041635] CPU1: failed to come online
Jul 18 10:04:22 kernel: [    1.041634] CPU1: failed to come online
Jul 18 10:05:00 kernel: [    1.041636] CPU1: failed to come online
Jul 18 10:05:41 kernel: [    1.041635] CPU1: failed to come online
Jul 18 10:06:04 kernel: [    1.041631] CPU1: failed to come online
Jul 18 10:34:06 kernel: [    1.041638] CPU1: failed to come online
Jul 18 10:34:24 kernel: [    1.041634] CPU1: failed to come online
Jul 18 10:34:46 kernel: [    1.041636] CPU1: failed to come online
Jul 18 10:35:31 kernel: [    1.041634] CPU1: failed to come online
Jul 18 14:53:17 kernel: [    1.041640] CPU1: failed to come online
Jul 18 14:55:20 kernel: [    1.041637] CPU1: failed to come online
Jul 18 14:55:47 kernel: [    1.041639] CPU1: failed to come online
Jul 18 14:56:11 kernel: [    1.041637] CPU1: failed to come online
Jul 18 14:57:23 kernel: [    1.041634] CPU1: failed to come online
Jul 18 14:57:48 kernel: [    1.041635] CPU1: failed to come online
Jul 18 14:58:35 kernel: [    1.041633] CPU1: failed to come online
popcornmix commented 3 years ago

From the description of problem firmware seems more likely, but the rpi-update test didn't seem to confirm that. You could try with everything at latest, then manually copy the firmware (start4.elf/fixup4.dat) from a good commit and see if problem is resolved.

MikeDB1 commented 3 years ago

Could anybody tell me why I have "You’re not receiving notifications from this thread." but get an email for every post ?

pelwell commented 3 years ago

No - sorry. That's GitHub, not us.

wagnerch commented 3 years ago

I think we did that up in this comment: https://github.com/Hexxeh/rpi-firmware/issues/232#issuecomment-660492994

Gave it another go, just went about it a different way:

sudo \
SKIP_WARNING=1 \
UPDATE_SELF=0 \
    rpi-update

curl -L -A curl https://github.com/Hexxeh/rpi-firmware/tarball/bafd743eeb3e8a2a863936594cd7201a0af136fa |tar xzf - -C "/tmp/firmware" --strip-components=1
cd /tmp/firmware
cp -p *.elf /boot/
cp -p *.dat /boot/
cp -p *.bin /boot/

bafd743e is the known working one for me, all 10/10 reboots come up with all 4 cores. This scenario was 40% failed to bring up CPU 1.

$ vcgencmd version
Jul  2 2020 14:59:18
Copyright (c) 2012 Broadcom
version 36c8be9515deddc9d2b1f469374f00d0a2df13f9 (clean) (release) (start)

$ uname -rmvs
Linux 5.4.51-v7l+ #1326 SMP Fri Jul 17 10:51:18 BST 2020 armv7l
pelwell commented 3 years ago

I just got a failure with the "suspect" commit (https://github.com/raspberrypi/linux/commit/cc5c7ce6d3218cab2b886364a824471b2acef277) reverted, so it isn't that.

pelwell commented 3 years ago

In the bad state, retrying the wake of CPU1 doesn't help. Working backwards through releases:

I'll test the 634e380a kernel with the latest firmware next, then vice-versa. and take it from there.

popcornmix commented 3 years ago

One core failing to start feels like an arm reset issue. We had this in the early days. Some chips were more susceptible than others. The fix involved ensuring one of the stb clocks was running prior to the (synchronous) arm reset. So my guess was firmware, and a commit related to changing when clocks were enabled.

But that doesn't seem to tie up with either of tests.

wagnerch commented 3 years ago

Interesting, because da3752a is pretty solid for me, I didn't reboot it for 30 minutes but 10 cycles and never had a problem. The other thing is if I build 5.4.51 with GCC 7.5.0 (Ubuntu/Linaro) cross compiler I also have no issues. I don't know if the later version of the tool chain happens to be slightly more or less optimized and it's just by chance timing.

pelwell commented 3 years ago

New kernel with 634e380 firmware fails. New firmware with 634e380 kernel also fails eventually. So I retested 634e380 as a whole and did eventually get it to fail.

The first 5.4 release (f0236cc) rebooted all night, and I'll continue to bisect through the day.

timg236 commented 3 years ago

Does this ever fail with the 4.19 kernel? I’ve looked though the ARM and clock related firmware changes and so far failed to reproduce the failure there. Although, it’s possible that both latest firmware and 5.4 are required

pelwell commented 3 years ago

Having found what I consider to be a 5.4 LKG I'm now working forwards, not backwards.

pelwell commented 3 years ago

Update: I think (and I can't be categorical because of the probabilistic nature of the failure) I've isolated the problem change to the kernel portion of https://github.com/Hexxeh/rpi-firmware/commit/a50c7d5eebb351d16665eabcedad992cdc167537 ("Bump to 5.4.45").

These commit hashes aren't all present in the current tree due to rebasing, but the last known good release is 3f54521ea and the 5.4.45 release is 9be502df. Comparing those two, 9be502df adds the following:

  Upstream:
    3604bc0 Linux 5.4.45
    40caf1b net: smsc911x: Fix runtime PM imbalance on error
    2528015 selftests: mlxsw: qos_mc_aware: Specify arping timeout as an integer
    aea1423 net: ethernet: stmmac: Enable interface clocks on probe for IPQ806x
    6992c89 net/ethernet/freescale: rework quiesce/activate for ucc_geth
    6a90489 null_blk: return error for invalid zone size
    b5cb7fe s390/mm: fix set_huge_pte_at() for empty ptes
    c0063f39 drm/edid: Add Oculus Rift S to non-desktop list
    c90e773 net: bmac: Fix read of MAC address from ROM
    92c09e8 x86/mmiotrace: Use cpumask_available() for cpumask_var_t variables
    ba55015 io_uring: initialize ctx->sqo_wait earlier
    f1c5821 i2c: altera: Fix race between xfer_msg and isr thread
    1857d7d scsi: pm: Balance pm_only counter of request queue during system resume
    1610cd9 evm: Fix RCU list related warnings
    31ca642 ARC: [plat-eznps]: Restrict to CONFIG_ISA_ARCOMPACT
    935ba01 ARC: Fix ICCM & DCCM runtime size checks
    8a69220 RDMA/qedr: Fix synchronization methods and memory leaks in qedr
    49e9267 RDMA/qedr: Fix qpids xarray api used
    0377fda s390/ftrace: save traced function caller
    0734b58 ASoC: intel - fix the card names
    6106585 spi: dw: use "smp_mb()" to avoid sending spi data error
    99c63ba powerpc/xmon: Restrict when kernel is locked down
    f2adfe1 powerpc/powernv: Avoid re-registration of imc debugfs directory
    a293045 scsi: hisi_sas: Check sas_port before using it
    cfd5ac76 drm/i915: fix port checks for MST support on gen >= 11
    74028c9 airo: Fix read overflows sending packets
    63ad3fb net: dsa: mt7530: set CPU port to fallback mode
    d628f7a scsi: ufs: Release clock if DMA map fails
    95ffc2a media: staging: ipu3-imgu: Move alignment attribute to field
    5b6e152 media: Revert "staging: imgu: Address a compiler warning on alignment"
    a122eef mmc: fix compilation of user API
    1c44e6e kernel/relay.c: handle alloc_percpu returning NULL in relay_open
    91e863a mt76: mt76x02u: Add support for newer versions of the XBox One wifi adapter
    8a6744e p54usb: add AirVasT USB stick device-id
    ac09eae HID: i2c-hid: add Schneider SCL142ALM to descriptor override
    3e8410c HID: multitouch: enable multi-input as a quirk for some devices
    aa0dd0e HID: sony: Fix for broken buttons on DS3 USB dongles
    df4988a mm: Fix mremap not considering huge pmd devmap
    3209e3e Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"

  Downstream:
    9be502d w1_therm: remove redundant assignments to variable ret
    cd9e064 w1_therm: Free the correct variable
    525d235 w1_therm: adding bulk read support to trigger multiple conversion on bus
    6272c0b w1_therm: adding alarm sysfs entry
    56d2e43 w1_therm: optimizing temperature read timings
    0e55ffd w1_therm: adding eeprom sysfs entry
    6bc69d4 w1_therm: adding resolution sysfs entry
    fadb881 w1_therm: adding ext_power sysfs entry
    0931a4c5 w1_therm: fix reset_select_slave during discovery
    0a6dbaa w1_therm: adding code comments and code reordering
    3ee63cb overlays: Update upstream overlays after vc4-kms-v3d change
    20509f5 overlays: i2c-gpio: Avoid open-drain warnings
    7744086 Revert "overlays: gpio-keys: Avoid open-drain warnings"
    46b071e snd_bcm2835: disable HDMI audio when vc4 is used (#3640)
    0654fb6 vc4: cec: Restore cec physical address on reconnect
    4203e65 staging: vchiq_arm: Use g_dma_dev for dma_unmap_sg

None of those commits stand out as obvious candidates, but I think we can rule out many of them as being either for the wrong platform (i.e. not compiled) or affecting code not yet run at the point of failure. I just hope it isn't a code placement problem.

pelwell commented 3 years ago

Having moved to the 5.4.47 release (dec0ddc5) after failing to find a bad commit in 5.4.45, I think we have a culprit:

It's a plausible result because it's a downstream patch that applies to our platform, and it deals with low-level stuff that might be run very early in the boot process.

If anyone has a moment and a Pi to spare, build either or both of those commits and put it in some kind of a reboot loop to see how long it takes for CPU1 not to come up. N.B. Don't do this unless you have thought carefully about how to break out of the reboot loop - you need an exit strategy.

popcornmix commented 3 years ago

Let me see if I can find a board that fails somewhat reliably. I did run a reboot loop script a couple of days ago and it did fail but took a long time. I'll try on a few other boards in case I have a quicker one.

popcornmix commented 3 years ago

I added this just before the exit 0 on /etc/rc.local if [ $(nproc) -eq 4 ]; then sudo reboot; fi But as @pelwell says, you'll need to use a linux machine to remove that line if it doesn't fail. (if it comes up with 3 cores you will get a prompt and can edit it directly).

pelwell commented 3 years ago

My approach is to put sh /boot/onboot.sh in /etc/rc.local to put the script in the FAT partition, and for convenience onboot.sh reads GPIO 7 with raspi-gpio, exiting if it isn't grounded. Then you put the task-specific logic.

popcornmix commented 3 years ago

Been rebooting for an hour across 5 Pi4 boards and so far I've not had a missing CPU... (this is without any reverts, just latest rpi-update)

pelwell commented 3 years ago

Looking through the code and the boot sequence I can't see how it can have an effect - none of the affected probe or init routines are called before the other CPUs are brought up - but just by adding that one commit I can get it to fail while without it I can't (but let's see if it survives the night).

Nope, spoke too soon - the "good" commit just failed.

philrandal commented 3 years ago

Still broken for me after the latest rpi-update. About one boot in 4 comes up with the error.

Pi 4 1GB latest firmware with boot from USB-attached SATA SSD.

pelwell commented 3 years ago

Have you not been following the conversation?

philrandal commented 3 years ago

Sorry, that was in response to popcornmix's comment.

pelwell commented 3 years ago

OK - popcornmix is definitely not questioning the reality of the problem, just stating that it can be tricky to reproduce and that there might even be a board-specific element to it.

wagnerch commented 3 years ago

Having moved to the 5.4.47 release (dec0ddc5) after failing to find a bad commit in 5.4.45, I think we have a culprit:

* d79f26f99acb is the last known good commit, and
* 2c2a2ea4d585 is the first bad commit.

Are these valid commits from raspberrypi/linux or they from somewhere else?

$ cat .git/refs/tags/v5.4.47 44edacf70fc991648b0dcb443a5106b17bc70e7e

pelwell commented 3 years ago

Ah - they were in rpi-5.4.y before they were rebased. Somebody who has been tracking the kernel will probably have them in their git cache, but you probably can't download them from GitHub now.

popcornmix commented 3 years ago

Anyone with a reliable failure, can you try bumping the firmware back to the last stable release, and leaving the kernel at the latest. Download: https://github.com/Hexxeh/rpi-firmware/raw/2d76ecb08cbf7a4656ac102df32a5fe448c930b1/start4.elf https://github.com/Hexxeh/rpi-firmware/raw/2d76ecb08cbf7a4656ac102df32a5fe448c930b1/fixup4.dat and replace the ones in /boot. (make a copy of existing ones first) vcgencmd version should report "6379679d1ec6a8c746d7e77e015f5b56b939976f" and "June 1 2020" Does the failure still occur?

wagnerch commented 3 years ago

Does the bootloader version have any contributing effect here?

$ vcgencmd bootloader_version Jun 15 2020 14:36:19 version c302dea096cc79f102cec12aeeb51abf392bd781 (release) timestamp 1592228179

philrandal commented 3 years ago

Alas, on 4th reboot with the June 1st firmware I got the error

pelwell commented 3 years ago

@wagnerch That is the question. I haven't yet seen any evidence that it does, but I won't rule it out.

popcornmix commented 3 years ago

Does anyone with the issue have an older bootloader than Jun 15?

popcornmix commented 3 years ago

@philrandal thanks. The evidence continues to point at a kernel change being responsible. It should be bisectable, but with boards that only occasionally fail it's easy to take a wrong turn.

If anyone with a reliable failure is able to use rpi-update to narrow down when the issue started, it would be very helpful.

pelwell commented 3 years ago

I'm currently focusing on https://github.com/Hexxeh/rpi-firmware/commit/a50c7d5eebb351d16665eabcedad992cdc167537, so starting there or either side would be great.

philrandal commented 3 years ago

I couldn't get a50c7d5 to misbehave.

Will try some more tomorrow.

popcornmix commented 3 years ago

Thanks. Fewer than 8 commits between there (good) and head (bad), so should just be 3 more tests. da3752a3 would be a useful one to test next.

philrandal commented 3 years ago

20 reboots with da3752a and no sign of problems

popcornmix commented 3 years ago

And is 20 reboots without a problem unheard of with apt version? (I need to check as succeeding 100 times in a row sometimes occurs on mine or @pelwell 's board).