Hexxeh / rpi-firmware

Firmware files for the Raspberry Pi
Other
775 stars 208 forks source link

Raspberry Pi 2B fail to boot occasionally CPUx: failed to come online #253

Open antoniosk opened 3 years ago

antoniosk commented 3 years ago

Kernel version: Linux P1 5.4.83-v7+ #1379 SMP Mon Dec 14 13:08:57 GMT 2020 armv7l GNU/Linux

The first Pi runs the "lite" Raspbian Buster image and the second the "desktop and recommended software". Both Raspberries boot to the console and all the installed packages come from Raspbian repositories.

Since the last December however, both Pis occasionally fail to boot. The network does not come up and when I connect a monitor, the devices freeze to the login: prompt. When I unplug/plug the Pis, everything is back to normal.

While investigating this issue I found that on every unsuccessful boot a CPU core does not come up. The following is logged in kern.log:

On an unsuccessful boot: Jan 18 21:42:08 PI kernel: [ 0.007635] smp: Bringing up secondary CPUs ... Jan 18 21:42:08 PI kernel: [ 1.040987] CPU1: failed to come online Jan 18 21:42:08 PI kernel: [ 1.042804] CPU2: update cpu_capacity 1024 Jan 18 21:42:08 PI kernel: [ 1.042816] CPU2: thread -1, cpu 2, socket 15, mpidr 80000f02 Jan 18 21:42:08 PI kernel: [ 1.044511] CPU3: update cpu_capacity 1024 Jan 18 21:42:08 PI kernel: [ 1.044524] CPU3: thread -1, cpu 3, socket 15, mpidr 80000f03 Jan 18 21:42:08 PI kernel: [ 1.044740] smp: Brought up 1 node, 3 CPUs Jan 18 21:42:08 PI kernel: [ 1.044866] SMP: Total of 3 processors activated (115.20 BogoMIPS).

On a successful boot: Jan 19 17:00:46 PI kernel: [ 0.007643] smp: Bringing up secondary CPUs ... Jan 19 17:00:46 PI kernel: [ 0.009263] CPU1: update cpu_capacity 1024 Jan 19 17:00:46 PI kernel: [ 0.009276] CPU1: thread -1, cpu 1, socket 15, mpidr 80000f01 Jan 19 17:00:46 PI kernel: [ 0.011320] CPU2: update cpu_capacity 1024 Jan 19 17:00:46 PI kernel: [ 0.011333] CPU2: thread -1, cpu 2, socket 15, mpidr 80000f02 Jan 19 17:00:46 PI kernel: [ 0.012983] CPU3: update cpu_capacity 1024 Jan 19 17:00:46 PI kernel: [ 0.012995] CPU3: thread -1, cpu 3, socket 15, mpidr 80000f03 Jan 19 17:00:46 PI kernel: [ 0.013205] smp: Brought up 1 node, 4 CPUs Jan 19 17:00:46 PI kernel: [ 0.013333] SMP: Total of 4 processors activated (153.60 BogoMIPS).

I don't know if this issue is related to issue #232 "CPU1: failed to come online with 5.4.51-v7l+" but I had not such problems with kernel 5.4.51.

Thank you in advance and hope you are all well and safe!

clivem commented 3 years ago

I posted about this several times in the forum "Moving Linux Kernel to 5.10" thread. First thought it was something new with the 5.10.x kernel, which I was testing at the time, until I saw it with Pi2 and 5.4.83-v7+ kernel.

It seems this isn't new behaviour with 5.10. Just witnessed it on a Pi2 with "official" stable apt 5.4.83 kernel.

pelwell commented 3 years ago

Have a read through this issue for some history: https://github.com/Hexxeh/rpi-firmware/issues/232

So far it seems like a problem in the CPUs that only appears before the caches are enabled. There is nothing wrong with the code being executed, but sometimes it doesn't work as it should. Code placement might be a factor, otherwise I can think of no explanation why some builds are affected and not others. The fact that the failure is probabilistic rather than guaranteed only makes it harder to diagnose.

antoniosk commented 3 years ago

Hard to diagnose indeed. I am also using a 4GB Pi4 with the official Raspbian Buster and all the updates installed as a secondary desktop without any problem so far.

On my Pi2 B I can confirm that: 1) The problem appears in 1 out of 7 to 9 reboots / cold boots. 2) Only three raspberry images appear on the screen during an unsuccessful boot. 3) Usually CPU1 fails. CPU2 failed only once in one of my Pi2 B. 4) The problem first appeared in December. According to apt history.log the kernel (raspberrypi-kernel:armhf (1.20201126-1, 1.20201201-1)) was updated on 4th of December 2020. Unfortunately the oldest kern.log records refer to beginning of January 2021 (was changed now to cover a longer period).

I can provide any other information i.e. log files etc, should you need it.

clivem commented 3 years ago
[    0.000000] Linux version 5.10.11-v7+ (dom@buildbot) (arm-linux-gnueabihf-gcc-8 (Ubuntu/Linaro 8.4.0-3ubuntu1) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #1399 SMP Thu Jan 28 12:06:05 GMT 2021
[    0.000000] OF: fdt: Machine model: Raspberry Pi 2 Model B Rev 1.1
[    1.040915] CPU2: failed to come online
pelwell commented 3 years ago

Oh good. Having just fixed an interesting I2C bug I was looking for another rabbit hole to disappear down.

antoniosk commented 3 years ago

That's interesting. Updated to kernel 5.10.11-v7+ on Thursday and the freeze problem after boot seems to be fixed. I made around 30 reboots from ssh without issue, but I also noticed the line

Feb 5 13:37:17 PI kernel: [ 1.040913] CPU2: failed to come online in kern.log.

As I am unaware of Raspberry Pi internals such as revision numbers, variants etc that may be relevant to the issue, I am posting some information from /proc/cpuinfo which applies to both devices I own:

Hardware : BCM2835 Revision : a01041 Model : Raspberry Pi 2 Model B Rev 1.1

CPU architecture: 7 CPU variant : 0x0 CPU part : 0xc07 CPU revision : 5

I will repeat the boot test from the console during the weekend checking kern.log for each reboot.

antoniosk commented 3 years ago

Kernel 5.10.11-v7+ made it harder to reproduce. Here are my results:

1) Appeared in 1 out of 30...50 reboots/cold boots. 2) If CPU1 fails, network and keyboard do not work making the Pi to "freeze". The term "freeze" is not exactly accurate; I assume that the USB hub depends on CPU1 and if CPU1 is down, the hub does not work. 3) If CPU2 fails, the USB hub (network and keyboard) works and /proc/cpuinfo reports only the 3 working cores 0, 1 and 3. 4) Only CPU1 or CPU2 failed during my tests.

As a workaround for (2) and based on (3), I use a small shell script to check the number CPU cores on /proc/cpuinfo on every boot. If this number is less than 4, the script reboots the Pi.