Dasharo / dasharo-issues

The Dasharo issue tracker
https://dasharo.com/
25 stars 0 forks source link

Minnowboard Turbot's CPU soft locks at random, affecting automated tests #1000

Open wiktormowinski opened 3 months ago

wiktormowinski commented 3 months ago

Component

Dasharo firmware

Device

other

Dasharo version

v0.9.0-rc1

Dasharo Tools Suite version

No response

Test case ID

No response

Brief summary

sometimes Minnowboard gets really slow or even freezes what is indicated by watchdog reporting that CPU is stuck for X seconds

How reproducible

rare about 30% of tests should get it during regression

How to reproduce

for auto: run absolutely any automated test suite, but for a near 100% fail chance i suggest CPF002.001 from dasharo-performance this is because cpf002.001 lasts 1h and during that time the softlock can occur at any time (most often happens around ~15min mark)

for manual:

  1. try logging in to OS via serial
  2. if you manage to do that just stay idle for a while and the error should pop up eventually

    Expected behavior

the tests should continue uninterrupted

Actual behavior

instead the tests get either

  1. lost (80%):
    • looking for login to os (90%)
    • looking for any checkpoint phrase (10%)
  2. CPU gets stuck and you get a message from watchdog (20%) watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [systemd-udevd:120]

though I am almost certain both of these stem from the very same problem

Screenshots

No response

Additional context

my cpf002.001 attempts documented: faile.zip

Solutions you've tried

No response

miczyg1 commented 3 months ago

Hard to say what may go wrong. It needs further investigation. Especially that it can be hard to reproduce, e.g. after 1h

wiktormowinski commented 2 months ago

After the initial fixes to the fw, focusing other issues, seems like this one got fixed too. You can boot to ubuntu and it doesn't soft lock after a minute (which was a common occurance if not quicker)

filipleple commented 2 months ago

It used to be pretty much unbearable, allowing the platform to be used for like 30sec at most. After building from the byt_fsp_parity branch however, it hasn't occurred for a while of normal use. I will run lengthy performance/stability tests today to confirm whether this has been completely resolved.

EDIT: that's only true for the SB binary. It still persists after building the non-SB config.

wiktormowinski commented 2 months ago

thanks for confirming

miczyg1 commented 1 week ago

The issue does not happen in a deterministic way. Sometimes the CPU soft-locks when the system is booting, sometimes when it is running for a couple of seconds and minutes. Printing cbmem console or dmesg on serial console helps with triggering the issue a little bit faster if it doesn't happen right off the bat. Some platforms were not affected by the issue (mainly quad core platforms).

I have analyzed the Bay Trail FSP source and compared it against Bay Trail native silicon init in coreboot and haven't found any major problems. A couple of things caught my eye regarding CPU P/C states, which I fixed per BWG, however, it didn't help. The work is on WIP PR: https://github.com/Dasharo/coreboot/pull/575

Now that I am thinking about it, maybe it is some issue with C6 state and C6 DRAM which ought to be reserved for it. That would imply some difference in MRC binary and FSP memory init.

coreboot 4.11 (lastest version which still had FSP baytrail support) did not have the problem, so it may be related to the MRC bin not doing something what should be done.