Dasharo / dasharo-issues

The Dasharo issue tracker
https://dasharo.com/
24 stars 0 forks source link

Talos II - 0.5.0 - One CPU : Doesn't boot #80

Closed tlaurion closed 2 years ago

tlaurion commented 2 years ago

Dasharo version 0.5.0 release from https://docs.dasharo.com/variants/talos_2/releases/

Dasharo variant Workstation

Affected component(s) or functionality Memory initialization fails ( 1x M393A1K43BB0-CRC in B1 memory slot)

Brief summary The coreboot output stops at

coreboot-4.14-517-gc92383f92c Tue Apr 12 12:28:24 UTC 2022 bootblock starting (log level: 7)...
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
HBI partition has ECC
HBI is in 0x00426200 through 0x0175f037
FMAP: Found "FLASH" version 1.1 at 0x20000.
FMAP: base = 0x0 size = 0x200000 #areas = 4
FMAP: area COREBOOT found @ 20200 (1965568 bytes)
CBFS: mcache @0xf8231000 built for 10 files, used 0x1f0 of 0x2000 bytes
CBFS: Found 'fallback/romstage' @0x80 size 0x12421 in mcache @0xf823102c
BS: bootblock times (exec / console): total (unknown) / 2 ms

coreboot-4.14-517-gc92383f92c Tue Apr 12 12:28:24 UTC 2022 romstage starting (log level: 7)...
IPMI: romstage PNP BT 0xe4
Get BMC self test result...Function Not Implemented
Initializing IPMI BMC watchdog timer
IPMI BMC watchdog initialized and started.
Initializing FSI...
Initialized FSI (chips mask: 0x01)
Building MVPDs...
starting istep 8.1
ending istep 8.1
starting istep 8.2
ending istep 8.2
starting istep 8.3
ending istep 8.3
starting istep 8.4
ending istep 8.4
starting istep 8.9
ending istep 8.9
starting istep 8.10
ending istep 8.10
starting istep 8.11
ending istep 8.11
starting istep 9.2
ending istep 9.2
starting istep 9.4
ending istep 9.4
starting istep 9.6
ending istep 9.6
starting istep 9.7
ending istep 9.7
starting istep 10.1
ending istep 10.1
starting istep 10.6
Base epsilon values read from table:
 R_T[0] = 10
 R_T[1] = 10
 R_T[2] = 79
 W_T[0] = 0
 W_T[1] = 21
Scaled epsilon values based on +20 percent guardband:
 R_T[0] = 12
 R_T[1] = 12
 R_T[2] = 95
 W_T[0] = 0
 W_T[1] = 26
ending istep 10.6
starting istep 10.10
ending istep 10.10
starting istep 10.12
ending istep 10.12
starting istep 10.13
ending istep 10.13
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
MEMD partition has ECC
MEMD is in 0x03cef200 through 0x03cfb917
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address 50
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address 51
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address 53
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address D4
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address D5
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address D6
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address D7
SPD @ 0x52
SPD: module type is DDR4
SPD: module part number is M393A1K43BB0-CRC    
SPD: banks 16, ranks 1, rows 16, columns 10, density 8192 Mb
SPD: device width 8 bits, bus width 64 bits
SPD: module size is 8192 MB (per channel)

coreboot-4.14-517-gc92383f92c Tue Apr 12 12:28:24 UTC 2022 bootblock starting (log level: 7)...
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
HBI partition has ECC
HBI is in 0x00426200 through 0x0175f037
FMAP: Found "FLASH" version 1.1 at 0x20000.
FMAP: base = 0x0 size = 0x200000 #areas = 4
FMAP: area COREBOOT found @ 20200 (1965568 bytes)
CBFS: mcache @0xf8231000 built for 10 files, used 0x1f0 of 0x2000 bytes
CBFS: Found 'fallback/romstage' @0x80 size 0x12421 in mcache @0xf823102c
BS: bootblock times (exec / console): total (unknown) / 2 ms

coreboot-4.14-517-gc92383f92c Tue Apr 12 12:28:24 UTC 2022 romstage starting (log level: 7)...
IPMI: romstage PNP BT 0xe4
Get BMC self test result...Function Not Implemented
Initializing IPMI BMC watchdog timer
IPMI BMC watchdog initialized and started.
Initializing FSI...
Initialized FSI (chips mask: 0x01)
Building MVPDs...
starting istep 8.1
ending istep 8.1
starting istep 8.2
ending istep 8.2
starting istep 8.3
ending istep 8.3
starting istep 8.4
ending istep 8.4
starting istep 8.9
ending istep 8.9
starting istep 8.10
ending istep 8.10
starting istep 8.11
ending istep 8.11
starting istep 9.2
ending istep 9.2
starting istep 9.4
ending istep 9.4
starting istep 9.6
ending istep 9.6
starting istep 9.7
ending istep 9.7
starting istep 10.1
ending istep 10.1
starting istep 10.6
Base epsilon values read from table:
 R_T[0] = 10
 R_T[1] = 10
 R_T[2] = 79
 W_T[0] = 0
 W_T[1] = 21
Scaled epsilon values based on +20 percent guardband:
 R_T[0] = 12
 R_T[1] = 12
 R_T[2] = 95
 W_T[0] = 0
 W_T[1] = 26
ending istep 10.6
starting istep 10.10
ending istep 10.10
starting istep 10.12
ending istep 10.12
starting istep 10.13
ending istep 10.13
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
MEMD partition has ECC
MEMD is in 0x03cef200 through 0x03cfb917
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address 50
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address 51
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address 53
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address D4
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address D5
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address D6
I2C transfer failed to complete (0x04011f0104000000)
No memory DIMM at address D7
SPD @ 0x52
SPD: module type is DDR4
SPD: module part number is M393A1K43BB0-CRC    
SPD: banks 16, ranks 1, rows 16, columns 10, density 8192 Mb
SPD: device width 8 bits, bus width 64 bits
SPD: module size is 8192 MB (per channel)

How reproducible Always (Single 16 cores CPU, one RAM module: 8GB (More info on changes needed to be documented on non-flashing instructions under https://github.com/Dasharo/dasharo-issues/issues/79)

How to reproduce

Steps to reproduce the behavior: On laptop:

user@captive-portal:~$ sha256sum -c raptor-cs_talos-2_coreboot_v0.5.0.rom.signed.ecc.sha256
raptor-cs_talos-2_coreboot_v0.5.0.rom.signed.ecc: OK
user@captive-portal:~$ sha256sum -c raptor-cs_talos-2_bootblock_v0.5.0.signed.ecc.sha256
raptor-cs_talos-2_bootblock_v0.5.0.signed.ecc: OK
user@captive-portal:~$ scp *v0.5.0*.ecc root@talos:/tmp/
raptor-cs_talos-2_bootblock_v0.5.0.signed.ecc 100%   28KB   1.3MB/s   00:00    
raptor-cs_talos-2_coreboot_v0.5.0.rom.signed.ecc 100% 2309KB   3.1MB/s   00:00    

On a seperated SSH connection to BMC:

pflash -r /tmp/talos.pnor
pflash -P HBB -p /tmp/raptor-cs_talos-2_bootblock_v0.5.0.signed.ecc -F /tmp/talos.pnor
pflash -P HBI -p /tmp/raptor-cs_talos-2_coreboot_v0.5.0.rom.signed.ecc -F /tmp/talos.pnor
systemctl stop mboxd
mboxd -f 64M -w 1M -b file:/tmp/talos.pnor -v

On another seperate SSH connection to BMC:

mboxctl --lpc-state
    “LPC Bus Maps: BMC Memory”
obmcutil poweron

Expected behavior Ram init succeeds and net steps are engaged

Actual behavior Stops at

SPD: module type is DDR4
SPD: module part number is M393A1K43BB0-CRC    
SPD: banks 16, ranks 1, rows 16, columns 10, density 8192 Mb
SPD: device width 8 bits, bus width 64 bits
SPD: module size is 8192 MB (per channel)
krystian-hebel commented 2 years ago

Sorry for the delay, I was busy with other projects and then reading provided log (almost 200k lines). Based on it, and additional SCOM dumps performed on platform, I have some rough idea of what happens. There were actually multiple errors (11 to be exact) reported in FIR, but some of them probably were caused by trying to load exception handler that does not exist.

We messed up a bit during simplification done to various isteps, like this one. As a result, a bit that tells that current CPU is the only one in the system is not set. Because of that, when code tries to write to RAM, a message is sent to other CPU that it should flush and invalidate its L3 cache. That other CPU does not exist, so main CPU doesn't receive an acknowledgement to that message. After quick and dirty fix in place (hardcoding it to 1 CPU version) "only" 4 errors were reported.

Further issues were caused by misunderstanding what ATTR_PROC_FABRIC_X_LINKS_CNFG is - we thought it is total number of X links, instead it is number of X links in use. For Nimbus this is 1 when there are 2 CPUs, 0 otherwise. With some additional hacks we got down to 2 reported errors, and a bit different manifestation of the problem - now platform doesn't reboot, instead it gets into infinite loop, but this may be caused by different layout of code with regard to cache lines.

I also noticed some differences that will become important when we get to OCC initialization, but we're not that far just yet. Also, if OCC was (partially) started earlier, it should be able to gather FIR SCOM, basically what I tried to get with scripts few comments above. This would however require starting OCC (at least) twice, and as I haven't thought it would be necessary, we decided to skip it in coreboot. AFAIK there is no ready-to-use tool for parsing those dumps.

Right now we're about to remove one CPU from our platform and see if we get exactly the same issue, as at least some of this problems should happen on every 1 CPU platform, if I understood it correctly.

krystian-hebel commented 2 years ago

Update: with additional changes coreboot is able to boot to ramstage, now it stops at No WOF table match found issue.

Final cause was wrong Power Bus frequency. Hostboot did read it from MVPD which we mimicked in coreboot, but what we haven't noticed is that the value is overwritten with a hardcoded one later. Hardcoded value is identical to value from MVPD for our CPU, which is why it worked in the first place.

WOF is another issue altogether, but I think it shouldn't be too hard to fix. We just didn't thought that there are processors which aren't included in WOF table, but this is what community testing is all about - to catch such corner cases.

tlaurion commented 2 years ago

Latest tests with unreleased version supported my single CPU!

Current issue now is to have Heads payload output to vga console, which seems to miss either AST+DRM in kernel config and/or proper skiboot passed arguments.

Off-channel notes:

tlaurion commented 1 year ago

@krystian-hebel I'm searching for the tool that was developped to sit on bmc and collect the logs without nohup. Can you tag me and point me to where it is? That should be added in a debugging page for Talos II board.

macpijan commented 1 year ago

I think it was here: https://github.com/3mdeb/openpower-coreboot-docs/pull/74/files

Please let me know what you believe what be the best place to put it

tlaurion commented 1 year ago

@macpijan : https://docs.dasharo.com/variants/talos_2/overview/ should have minimally a link to https://github.com/3mdeb/openpower-coreboot-docs

But a debugging page draft would be useful, pointing directly to https://github.com/3mdeb/openpower-coreboot-docs/blob/main/devnotes/scat/README.md to facilitate bug reporting?

krystian-hebel commented 1 year ago

Maybe we should move all user-level documentation to docs.dasharo.com and leave openpower-coreboot-docs just for developer stuff like logs, istep analysis, early design considerations etc? There is some overlap between those two repositories, e.g. release info or flashing instructions. What's worse, they are slightly different...