Dasharo / dasharo-issues

The Dasharo issue tracker
https://dasharo.com/
25 stars 0 forks source link

Talos II : One CPU one ram module 0.4.1 release - SCOM stopped working #81

Closed tlaurion closed 2 years ago

tlaurion commented 2 years ago

Dasharo version 0.4.1

Dasharo variant Workstation, 1CPU one ram module. bootblock + coreboot 0.4.1 release No-flashing instructions per #79

Affected component(s) or functionality SCOM stopped working

Brief summary SCOM stops working after step 14.5

How reproducible At all times booting from non-flashed testing #79

How to reproduce Laptop:

user@captive-portal:~$ wget https://3mdeb.com/open-source-firmware/Dasharo/raptor-cs_talos-2/dasharo_talos_2_bootblock_v0.4.1.signed.ecc
--2022-04-26 16:54:29--  https://3mdeb.com/open-source-firmware/Dasharo/raptor-cs_talos-2/dasharo_talos_2_bootblock_v0.4.1.signed.ecc
Resolving 3mdeb.com (3mdeb.com)... 178.32.205.96
Connecting to 3mdeb.com (3mdeb.com)|178.32.205.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28674 (28K)
Saving to: ‘dasharo_talos_2_bootblock_v0.4.1.signed.ecc’

dasharo_talos_2_bootblock_v0.4.1.sign 100%[=======================================================================>]  28.00K  --.-KB/s    in 0.1s    

2022-04-26 16:54:29 (234 KB/s) - ‘dasharo_talos_2_bootblock_v0.4.1.signed.ecc’ saved [28674/28674]

user@captive-portal:~$ wget https://3mdeb.com/open-source-firmware/Dasharo/raptor-cs_talos-2/dasharo_talos_2_bootblock_v0.4.1.signed.ecc.sha256
--2022-04-26 16:54:43--  https://3mdeb.com/open-source-firmware/Dasharo/raptor-cs_talos-2/dasharo_talos_2_bootblock_v0.4.1.signed.ecc.sha256
Resolving 3mdeb.com (3mdeb.com)... 178.32.205.96
Connecting to 3mdeb.com (3mdeb.com)|178.32.205.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 110
Saving to: ‘dasharo_talos_2_bootblock_v0.4.1.signed.ecc.sha256’

dasharo_talos_2_bootblock_v0.4.1.sign 100%[=======================================================================>]     110  --.-KB/s    in 0s      

2022-04-26 16:54:44 (9.53 MB/s) - ‘dasharo_talos_2_bootblock_v0.4.1.signed.ecc.sha256’ saved [110/110]

user@captive-portal:~$ wget https://3mdeb.com/open-source-firmware/Dasharo/raptor-cs_talos-2/dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc
--2022-04-26 16:54:57--  https://3mdeb.com/open-source-firmware/Dasharo/raptor-cs_talos-2/dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc
Resolving 3mdeb.com (3mdeb.com)... 178.32.205.96
Connecting to 3mdeb.com (3mdeb.com)|178.32.205.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2363904 (2.3M)
Saving to: ‘dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc’

dasharo_talos_2_coreboot_v0.4.1.rom.s 100%[=======================================================================>]   2.25M   509KB/s    in 4.6s    

2022-04-26 16:55:02 (503 KB/s) - ‘dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc’ saved [2363904/2363904]

user@captive-portal:~$ wget https://3mdeb.com/open-source-firmware/Dasharo/raptor-cs_talos-2/dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc.sha256
--2022-04-26 16:55:08--  https://3mdeb.com/open-source-firmware/Dasharo/raptor-cs_talos-2/dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc.sha256
Resolving 3mdeb.com (3mdeb.com)... 178.32.205.96
Connecting to 3mdeb.com (3mdeb.com)|178.32.205.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 113
Saving to: ‘dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc.sha256’

dasharo_talos_2_coreboot_v0.4.1.rom.s 100%[=======================================================================>]     113  --.-KB/s    in 0s      

2022-04-26 16:55:09 (29.2 MB/s) - ‘dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc.sha256’ saved [113/113]

user@captive-portal:~$ sha256sum -c *0.4.1*.sha256
dasharo_talos_2_bootblock_v0.4.1.signed.ecc: OK
dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc: OK
user@captive-portal:~$ scp  *0.4.1*.ecc root@talos:/tmp/
root@talos's password: 
dasharo_talos_2_bootblock_v0.4.1.signed.ecc                                                                         100%   28KB   1.1MB/s   00:00    
dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc                                                                      100% 2309KB   3.5MB/s   00:00

First SSH session to BMC:

root@talos:~# systemctl start mboxd   
root@talos:~# mboxctl --lpc-state
LPC Bus Maps: Flash Device
root@talos:~# pflash -r /tmp/talos.pnor
Reading to "/tmp/talos.pnor" from 0x00000000..0x04000000 !
[==================================================] 100% ETA:0s    
root@talos:~# pflash -P HBI -p /tmp/dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc -F /tmp/talos.pnor
About to program "/tmp/dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc" at 0x00425000..0x00666200 !
WARNING ! This will modify your HOST flash chip content !
Enter "yes" to confirm:yes
Programming & Verifying...
[==================================================] 100%
Updating actual size in partition header...
(reverse-i-search)`pfl': ^Clash -r /tmp/talos.pnor
root@talos:~# pflash -P HBB -p /tmp/dasharo_talos_2_ -F /tmp/talos.pnor
dasharo_talos_2_bootblock_v0.4.1.signed.ecc     dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc  
root@talos:~# pflash -P HBB -p /tmp/dasharo_talos_2_bootblock_v0.4.1.signed.ecc -F /tmp/talos.pnor
About to program "/tmp/dasharo_talos_2_bootblock_v0.4.1.signed.ecc" at 0x00205000..0x0020c002 !
WARNING ! This will modify your HOST flash chip content !
Enter "yes" to confirm:yes
Programming & Verifying...
[==================================================] 100%
root@talos:~# obmcutil poweroff
root@talos:~# systemctl stop mboxd
root@talos:~# mboxd -f 64M -w 1M -b file:/tmp/talos.pnor -v
[ 1651006813.456922643] Flash size: 0x04000000
[ 1651006813.459255206] Verbose logging
[ 1651006813.460769547] Starting Daemon
[ 1651006813.465250471] Window size: 0x00100000
[ 1651006813.467833741] Number of windows: 64
[ 1651006814.106675328] Pointing HOST LPC bus at memory region 0x72f55000 of size 0x04000000
[ 1651006814.108474801] LPC address 0x0c000000
[ 1651006814.119347272] Entering Polling Loop

Second SSH session to BMC:

root@talos:~# mboxctl --lpc-state
LPC Bus Maps: BMC Memory
root@talos:~# obmcutil poweron

Expected behavior SCOM not stopping, next steps continuing

Actual behavior

coreboot-4.14-387-g7258fa59c0 Thu Dec  9 13:44:40 UTC 2021 bootblock starting (log level: 7)...
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
HBI partition has ECC
HBI is in 0x00426200 through 0x0175f037
FMAP: Found "FLASH" version 1.1 at 0x20000.
FMAP: base = 0x0 size = 0x200000 #areas = 4
FMAP: area COREBOOT found @ 20200 (1965568 bytes)
CBFS: mcache @0x00031000 built for 8 files, used 0x1a8 of 0x2000 bytes
CBFS: Found 'fallback/romstage' @0x80 size 0xe6e6 in mcache @0x0003102c
BS: bootblock times (exec / console): total (unknown) / 2 ms

coreboot-4.14-387-g7258fa59c0 Thu Dec  9 13:44:40 UTC 2021 romstage starting (log level: 7)...
IPMI: romstage PNP BT 0xe4
Get BMC self test result...Function Not Implemented
Initializing IPMI BMC watchdog timer
IPMI BMC watchdog initialized and started.
starting istep 10.10
ending istep 10.10
starting istep 10.12
ending istep 10.12
starting istep 10.13
ending istep 10.13
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
MEMD partition has ECC
MEMD is in 0x03cef200 through 0x03cfb917
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address 50
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address 51
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address 53
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D4
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D5
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D6
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D7
SPD @ 0x52
SPD: module type is DDR4
SPD: module part number is M393A1K43BB0-CRC    
SPD: banks 16, ranks 1, rows 16, columns 10, density 8192 Mb
SPD: device width 8 bits, bus width 64 bits
SPD: module size is 8192 MB (per channel)
starting istep 13.2
ending istep 13.2
starting istep 13.3
putRing took 3 ms
ending istep 13.3
starting istep 13.4
ending istep 13.4
starting istep 13.6
ending istep 13.6
starting istep 13.8
Base epsilon values read from table:
 R_T[0] = 10
 R_T[1] = 10
 R_T[2] = 79
 W_T[0] = 0
 W_T[1] = 21
Scaled epsilon values based on +20 percent guardband:
 R_T[0] = 12
 R_T[1] = 12
 R_T[2] = 95
 W_T[0] = 0
 W_T[1] = 26
Please FIXME: ATTR_MSS_RUNTIME_MEM_THROTTLED_N_COMMANDS_PER_SLOT
ending istep 13.8
starting istep 13.9
Error detected in IOM_PHY0_DDRPHY_FIR_REG: 0x80
ending istep 13.9
starting istep 13.10
CCS took 2 us (3 us timeout), 1 instruction(s)
CCS took 2 us (2 us timeout), 14 instruction(s)
RCD dump for I2C address 0x5a:
0x0000faa0: 80 b3 40 42 30 00 00 00 02 01 00 03 cb e3 c0 0d  ..@B0...........
0x0000fab0: 00 00 39 00 00 00 00 00 00 00 07 00 00 00 00 00  ..9.............
ending istep 13.10
starting istep 13.11
CCS took 2 us (7 us timeout), 2 instruction(s)
Write Leveling starting
CCS took 13 us (92 us timeout), 5 instruction(s)
Write Leveling done
Initial Pattern Write starting
CCS took 6 us (38 us timeout), 5 instruction(s)
Initial Pattern Write done
DQS alignment starting
CCS took 11 us (44 us timeout), 1 instruction(s)
DQS alignment done
Read Clock Alignment starting
CCS took 8 us (82 us timeout), 1 instruction(s)
Read Clock Alignment done
Read Centering starting
CCS took 36 us (120 us timeout), 1 instruction(s)
Read Centering done
Write Centering starting
CCS took 8336 us (11314 us timeout), 7 instruction(s)
Write Centering done
Coarse write/read starting
CCS took 5 us (24 us timeout), 1 instruction(s)
Coarse write/read done
MCS0 MCA1 DIMM0 has 0 bad nibble(s) and 0 bad bit(s), but can be recovered
ending istep 13.11
starting istep 13.13
ending istep 13.13
starting istep 14.1
MCBIST0 took 616785 us
ending istep 14.1
starting istep 14.3
Initializing PEC0...
Initializing PEC1...
Initializing PEC2...
Initializing PHB0...
Initializing PHB1...
Initializing PHB2...
Initializing PHB3...
Initializing PHB4...
Initializing PHB5...
ending istep 14.3
starting istep 14.5
ending istep 14.5
0xF000F = 221d104900008040
SCOM stopped working, check FIRs, halting now
tlaurion commented 2 years ago

Will redo this one, might have not


root@talos:~# pflash -P HBI -p /tmp/dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc -F /tmp/talos.pnor
About to program "/tmp/dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc" at 0x00425000..0x00666200 !
WARNING ! This will modify your HOST flash chip content !
Enter "yes" to confirm:yes
Programming & Verifying...
[==================================================] 100%
Updating actual size in partition header...
(reverse-i-search)`pfl': ^Clash -r /tmp/talos.pnor
root@talos:~# pflash -P HBB -p /tmp/dasharo_talos_2_ -F /tmp/talos.pnor
dasharo_talos_2_bootblock_v0.4.1.signed.ecc     dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc  
root@talos:~# pflash -P HBB -p /tmp/dasharo_talos_2_bootblock_v0.4.1.signed.ecc -F /tmp/talos.pnor
About to program "/tmp/dasharo_talos_2_bootblock_v0.4.1.signed.ecc" at 0x00205000..0x0020c002 !
WARNING ! This will modify your HOST flash chip content !
Enter "yes" to confirm:yes
Programming & Verifying...
[==================================================] 100%
Updating actual size in partition header...

Correctly.

krystian-hebel commented 2 years ago

This should be fixed in v0.5.0. Here we forgot to remove hardcoded CPU version, sorry about that...

And by the way, thanks for testing!

tlaurion commented 2 years ago

Just for clarity:


2. We need to manipulate instructions from “Flash the binaries by replacing HBB partition” above:
            1. HBB partition is bootblock per instructions
            2. HBI partition is coreboot per instructions
            3. BOOTKERNEL partition is either Petitboot or Heads per instructions

Redid test:

user@captive-portal:~$ ssh -l root talos
root@talos's password: 
root@talos:~# mboxctl --lpc-state
LPC Bus Maps: BMC Memory
root@talos:~# systemctl stop mboxd
root@talos:~# systemctl start mboxd
root@talos:~# mboxctl --lpc-state
LPC Bus Maps: Flash Device
root@talos:~# systemctl stop mboxd
root@talos:~# mboxctl --lpc-state
root@talos:~# systemctl start mboxd
root@talos:~# mboxctl --lpc-state
LPC Bus Maps: Flash Device
root@talos:~# pflash -r /tmp/talos.pnor
Reading to "/tmp/talos.pnor" from 0x00000000..0x04000000 !
[==================================================] 100% ETA:0s     
root@talos:~# pflash -P HBB -p /tmp/dasharo_talos_2_bootblock_v0.4.1.signed.ecc -F /tmp/talos.pnor
About to program "/tmp/dasharo_talos_2_bootblock_v0.4.1.signed.ecc" at 0x00205000..0x0020c002 !
WARNING ! This will modify your HOST flash chip content !
Enter "yes" to confirm:yes
Programming & Verifying...
[==================================================] 100%
Updating actual size in partition header...
root@talos:~# pflash -P HBI -p /tmp/dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc -F /tmp/talos.pnor
About to program "/tmp/dasharo_talos_2_coreboot_v0.4.1.rom.signed.ecc" at 0x00425000..0x00666200 !
WARNING ! This will modify your HOST flash chip content !
Enter "yes" to confirm:yes
Programming & Verifying...
[==================================================] 100%
Updating actual size in partition header...
root@talos:~# mboxctl --lpc-state
LPC Bus Maps: Flash Device
root@talos:~# mboxctl --backend file:/tmp/talos.pnor
SetBackend: Success
root@talos:~# mboxctl --lpc-state
LPC Bus Maps: BMC Memory
root@talos:~# obmcutil poweron

Same result

miczyg1 commented 2 years ago

We will clarify the regions in the documentations as a part of the https://github.com/Dasharo/dasharo-issues/issues/79 According to @krystian-hebel comments, 0.4.1. release have no chance to work with DD2.1 CPU:

This should be fixed in v0.5.0. Here we forgot to remove hardcoded CPU version, sorry about that...

And by the way, thanks for testing!

So this qualifies to be closed. @krystian-hebel let's backport the fix and confirm, then close. If anything similar to that appears again @tlaurion please open a new issue with new affected version.

krystian-hebel commented 2 years ago

Interestingly enough, the fix is already present in 0.4.1: https://github.com/Dasharo/coreboot/blame/raptor-cs_talos-2/rel_v0.4.1/src/soc/ibm/power9/romstage.c#L399

However, the commit hash reported by binaries ("coreboot-4.14-387-g7258fa59c0") doesn't match anything in the tree, so whoever produced those did something strange... @pietrushnic @macpijan @IgorBagnucki CC

@tlaurion please try with these: https://cloud.3mdeb.com/index.php/s/MSLKxazwKsCoi68

tlaurion commented 2 years ago

@krystian-hebel : There seems to be a stop of boot loops after the 5th manual obmcutil poweron. Talos is still hanging there as I write this.

With current https://cloud.3mdeb.com/index.php/s/MSLKxazwKsCoi68 tested ROM:

root@talos:~# cat /var/log/obmc-console.log (exerpt):

C2...
Initializing PHB0...
Initializing PHB1...
Initializing PHB2...
Initializing PHB3...
Initializing PHB4...
Initializing PHB5...
ending istep 14.3
starting istep 14.5
ending istep 14.5
0xF000F = 221d104900008040
CBMEM:
IMD: root @ 0xffeff000 254 entries.
IMD: root @ 0xffefec00 62 entries.
FMAP: area COREBOOT found @ 20200 (1965568 bytes)
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
HBI partition has ECC
HBI is in 0x00426200 through 0x0175f037
CBFS: Found 'fallback/ramstage' @0xe7c0 size 0xd34c in mcache @0xf8231080
BS: romstage times (exec / console): total (unknown) / 16 ms

coreboot-4.14-400-g372ee3d300 Thu Dec  9 13:01:48 UTC 2021 ramstage starting (log level: 7)...
Enumerating buses...
Root Device scanning...
DD21, boot core

coreboot-4.14-400-g372ee3d300 Thu Dec  9 13:01:48 UTC 2021 bootblock starting (log level: 7)...
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
HBI partition has ECC
HBI is in 0x00426200 through 0x0175f037
FMAP: Found "FLASH" version 1.1 at 0x20000.
FMAP: base = 0x0 size = 0x200000 #areas = 4
FMAP: area COREBOOT found @ 20200 (1965568 bytes)
CBFS: mcache @0xf8231000 built for 8 files, used 0x1a8 of 0x2000 bytes
CBFS: Found 'fallback/romstage' @0x80 size 0xe6c3 in mcache @0xf823102c
BS: bootblock times (exec / console): total (unknown) / 2 ms

coreboot-4.14-400-g372ee3d300 Thu Dec  9 13:01:48 UTC 2021 romstage starting (log level: 7)...
IPMI: romstage PNP BT 0xe4
Get BMC self test result...Function Not Implemented
Initializing IPMI BMC watchdog timer
IPMI BMC watchdog initialized and started.
starting istep 10.10
ending istep 10.10
starting istep 10.12
ending istep 10.12
starting istep 10.13
ending istep 10.13
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
MEMD partition has ECC
MEMD is in 0x03cef200 through 0x03cfb917
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address 50
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address 51
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address 53
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D4
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D5
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D6
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D7
SPD @ 0x52
SPD: module type is DDR4
SPD: module part number is M393A1K43BB0-CRC    
SPD: banks 16, ranks 1, rows 16, columns 10, density 8192 Mb
SPD: device width 8 bits, bus width 64 bits
SPD: module size is 8192 MB (per channel)
starting istep 13.2
ending istep 13.2
starting istep 13.3
putRing took 3 ms
ending istep 13.3
starting istep 13.4
ending istep 13.4
starting istep 13.6
ending istep 13.6
starting istep 13.8
Base epsilon values read from table:
 R_T[0] = 10
 R_T[1] = 10
 R_T[2] = 79
 W_T[0] = 0
 W_T[1] = 21
Scaled epsilon values based on +20 percent guardband:
 R_T[0] = 12
 R_T[1] = 12
 R_T[2] = 95
 W_T[0] = 0
 W_T[1] = 26
Please FIXME: ATTR_MSS_RUNTIME_MEM_THROTTLED_N_COMMANDS_PER_SLOT
ending istep 13.8
starting istep 13.9
Error detected in IOM_PHY0_DDRPHY_FIR_REG: 0x80
ending istep 13.9
starting istep 13.10
CCS took 2 us (3 us timeout), 1 instruction(s)
CCS took 2 us (2 us timeout), 14 instruction(s)
RCD dump for I2C address 0x5a:
0xf820faa0: 80 b3 40 42 30 00 00 00 02 01 00 03 cb e3 c0 0d  ..@B0...........
0xf820fab0: 00 00 39 00 00 00 00 00 00 00 07 00 00 00 00 00  ..9.............
ending istep 13.10
starting istep 13.11
CCS took 2 us (7 us timeout), 2 instruction(s)
Write Leveling starting
CCS took 13 us (92 us timeout), 5 instruction(s)
Write Leveling done
Initial Pattern Write starting
CCS took 6 us (38 us timeout), 5 instruction(s)
Initial Pattern Write done
DQS alignment starting
CCS took 9 us (44 us timeout), 1 instruction(s)
DQS alignment done
Read Clock Alignment starting
CCS took 8 us (82 us timeout), 1 instruction(s)
Read Clock Alignment done
Read Centering starting
CCS took 36 us (120 us timeout), 1 instruction(s)
Read Centering done
Write Centering starting
CCS took 9627 us (11314 us timeout), 7 instruction(s)
Write Centering done
Coarse write/read starting
CCS took 5 us (24 us timeout), 1 instruction(s)
Coarse write/read done
MCS0 MCA1 DIMM0 has 0 bad nibble(s) and 0 bad bit(s), but can be recovered
ending istep 13.11
starting istep 13.13
ending istep 13.13
starting istep 14.1
MCBIST0 took 613672 us
ending istep 14.1
starting istep 14.3
Initializing PEC0...
Initializing PEC1...
Initializing PEC2...
Initializing PHB0...
Initializing PHB1...
Initializing PHB2...
Initializing PHB3...
Initializing PHB4...
Initializing PHB5...
ending istep 14.3
starting istep 14.5
ending istep 14.5
0xF000F = 221d104900008040
CBMEM:
IMD: root @ 0xffeff000 254 entries.
IMD: root @ 0xffefec00 62 entries.
FMAP: area COREBOOT found @ 20200 (1965568 bytes)
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
HBI partition has ECC
HBI is in 0x00426200 through 0x0175f037
CBFS: Found 'fallback/ramstage' @0xe7c0 size 0xd34c in mcache @0xf8231080
BS: romstage times (exec / console): total (unknown) / 16 ms

coreboot-4.14-400-g372ee3d300 Thu Dec  9 13:01:48 UTC 2021 ramstage starting (log level: 7)...
Enumerating buses...
Root Device scanning...
DD21, boot core:

coreboot-4.14-400-g372ee3d300 Thu Dec  9 13:01:48 UTC 2021 bootblock starting (log level: 7)...
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
HBI partition has ECC
HBI is in 0x00426200 through 0x0175f037
FMAP: Found "FLASH" version 1.1 at 0x20000.
FMAP: base = 0x0 size = 0x200000 #areas = 4
FMAP: area COREBOOT found @ 20200 (1965568 bytes)
CBFS: mcache @0xf8231000 built for 8 files, used 0x1a8 of 0x2000 bytes
CBFS: Found 'fallback/romstage' @0x80 size 0xe6c3 in mcache @0xf823102c
BS: bootblock times (exec / console): total (unknown) / 2 ms

coreboot-4.14-400-g372ee3d300 Thu Dec  9 13:01:48 UTC 2021 romstage starting (log level: 7)...
IPMI: romstage PNP BT 0xe4
Get BMC self test result...Function Not Implemented
Initializing IPMI BMC watchdog timer
IPMI BMC watchdog initialized and started.
starting istep 10.10
ending istep 10.10
starting istep 10.12
ending istep 10.12
starting istep 10.13
ending istep 10.13
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
MEMD partition has ECC
MEMD is in 0x03cef200 through 0x03cfb917
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address 50
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address 51
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address 53
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D4
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D5
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D6
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D7
SPD @ 0x52
SPD: module type is DDR4
SPD: module part number is M393A1K43BB0-CRC    
SPD: banks 16, ranks 1, rows 16, columns 10, density 8192 Mb
SPD: device width 8 bits, bus width 64 bits
SPD: module size is 8192 MB (per channel)
starting istep 13.2
ending istep 13.2
starting istep 13.3
putRing took 3 ms
ending istep 13.3
starting istep 13.4
ending istep 13.4
starting istep 13.6
ending istep 13.6
starting istep 13.8
Base epsilon values read from table:
 R_T[0] = 10
 R_T[1] = 10
 R_T[2] = 79
 W_T[0] = 0
 W_T[1] = 21
Scaled epsilon values based on +20 percent guardband:
 R_T[0] = 12
 R_T[1] = 12
 R_T[2] = 95
 W_T[0] = 0
 W_T[1] = 26
Please FIXME: ATTR_MSS_RUNTIME_MEM_THROTTLED_N_COMMANDS_PER_SLOT
ending istep 13.8
starting istep 13.9
Error detected in IOM_PHY0_DDRPHY_FIR_REG: 0x80
ending istep 13.9
starting istep 13.10
CCS took 2 us (3 us timeout), 1 instruction(s)
CCS took 2 us (2 us timeout), 14 instruction(s)
RCD dump for I2C address 0x5a:
0xf820faa0: 80 b3 40 42 30 00 00 00 02 01 00 03 cb e3 c0 0d  ..@B0...........
0xf820fab0: 00 00 39 00 00 00 00 00 00 00 07 00 00 00 00 00  ..9.............
ending istep 13.10
starting istep 13.11
CCS took 2 us (7 us timeout), 2 instruction(s)
Write Leveling starting
CCS took 13 us (92 us timeout), 5 instruction(s)
Write Leveling done
Initial Pattern Write starting
CCS took 6 us (38 us timeout), 5 instruction(s)
Initial Pattern Write done
DQS alignment starting
CCS took 10 us (44 us timeout), 1 instruction(s)
DQS alignment done
Read Clock Alignment starting
CCS took 8 us (82 us timeout), 1 instruction(s)
Read Clock Alignment done
Read Centering starting
CCS took 36 us (120 us timeout), 1 instruction(s)
Read Centering done
Write Centering starting
CCS took 8527 us (11314 us timeout), 7 instruction(s)
Write Centering done
Coarse write/read starting
CCS took 5 us (24 us timeout), 1 instruction(s)
Coarse write/read done
MCS0 MCA1 DIMM0 has 0 bad nibble(s) and 0 bad bit(s), but can be recovered
ending istep 13.11
starting istep 13.13
ending istep 13.13
starting istep 14.1
MCBIST0 took 613670 us
ending istep 14.1
starting istep 14.3
Initializing PEC0...
Initializing PEC1...
Initializing PEC2...
Initializing PHB0...
Initializing PHB1...
Initializing PHB2...
Initializing PHB3...
Initializing PHB4...
Initializing PHB5...
ending istep 14.3
starting istep 14.5
ending istep 14.5
0xF000F = 221d104900008040
CBMEM:
IMD: root @ 0xffeff000 254 entries.
IMD: root @ 0xffefec00 62 entries.
FMAP: area COREBOOT found @ 20200 (1965568 bytes)
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
HBI partition has ECC
HBI is in 0x00426200 through 0x0175f037
CBFS: Found 'fallback/ramstage' @0xe7c0 size 0xd34c in mcache @0xf8231080
BS: romstage times (exec / console): total (unknown) / 16 ms

coreboot-4.14-400-g372ee3d300 Thu Dec  9 13:01:48 UTC 2021 ramstage starting (log level: 7)...
Enumerating buses...
Root Device scanning...
DD21, boot core: 18
 boot core

coreboot-4.14-400-g372ee3d300 Thu Dec  9 13:01:48 UTC 2021 bootblock starting (log level: 7)...
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
HBI partition has ECC
HBI is in 0x00426200 through 0x0175f037
FMAP: Found "FLASH" version 1.1 at 0x20000.
FMAP: base = 0x0 size = 0x200000 #areas = 4
FMAP: area COREBOOT found @ 20200 (1965568 bytes)
CBFS: mcache @0xf8231000 built for 8 files, used 0x1a8 of 0x2000 bytes
CBFS: Found 'fallback/romstage' @0x80 size 0xe6c3 in mcache @0xf823102c
BS: bootblock times (exec / console): total (unknown) / 2 ms

coreboot-4.14-400-g372ee3d300 Thu Dec  9 13:01:48 UTC 2021 romstage starting (log level: 7)...
IPMI: romstage PNP BT 0xe4
Get BMC self test result...Function Not Implemented
Initializing IPMI BMC watchdog timer
IPMI BMC watchdog initialized and started.
starting istep 10.10
ending istep 10.10
starting istep 10.12
ending istep 10.12
starting istep 10.13
ending istep 10.13
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
MEMD partition has ECC
MEMD is in 0x03cef200 through 0x03cfb917
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address 50
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address 51
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address 53
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D4
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D5
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D6
I2C transfer failed (0x04011f0104000000)
No memory DIMM at address D7
SPD @ 0x52
SPD: module type is DDR4
SPD: module part number is M393A1K43BB0-CRC    
SPD: banks 16, ranks 1, rows 16, columns 10, density 8192 Mb
SPD: device width 8 bits, bus width 64 bits
SPD: module size is 8192 MB (per channel)
starting istep 13.2
ending istep 13.2
starting istep 13.3
putRing took 3 ms
ending istep 13.3
starting istep 13.4
ending istep 13.4
starting istep 13.6
ending istep 13.6
starting istep 13.8
Base epsilon values read from table:
 R_T[0] = 10
 R_T[1] = 10
 R_T[2] = 79
 W_T[0] = 0
 W_T[1] = 21
Scaled epsilon values based on +20 percent guardband:
 R_T[0] = 12
 R_T[1] = 12
 R_T[2] = 95
 W_T[0] = 0
 W_T[1] = 26
Please FIXME: ATTR_MSS_RUNTIME_MEM_THROTTLED_N_COMMANDS_PER_SLOT
ending istep 13.8
starting istep 13.9
Error detected in IOM_PHY0_DDRPHY_FIR_REG: 0x80
ending istep 13.9
starting istep 13.10
CCS took 2 us (3 us timeout), 1 instruction(s)
CCS took 2 us (2 us timeout), 14 instruction(s)
RCD dump for I2C address 0x5a:
0xf820faa0: 80 b3 40 42 30 00 00 00 02 01 00 03 cb e3 c0 0d  ..@B0...........
0xf820fab0: 00 00 39 00 00 00 00 00 00 00 07 00 00 00 00 00  ..9.............
ending istep 13.10
starting istep 13.11
CCS took 2 us (7 us timeout), 2 instruction(s)
Write Leveling starting
CCS took 13 us (92 us timeout), 5 instruction(s)
Write Leveling done
Initial Pattern Write starting
CCS took 6 us (38 us timeout), 5 instruction(s)
Initial Pattern Write done
DQS alignment starting
CCS took 10 us (44 us timeout), 1 instruction(s)
DQS alignment done
Read Clock Alignment starting
CCS took 8 us (82 us timeout), 1 instruction(s)
Read Clock Alignment done
Read Centering starting
CCS took 37 us (120 us timeout), 1 instruction(s)
Read Centering done
Write Centering starting
CCS took 8423 us (11314 us timeout), 7 instruction(s)
Write Centering done
Coarse write/read starting
CCS took 5 us (24 us timeout), 1 instruction(s)
Coarse write/read done
MCS0 MCA1 DIMM0 has 0 bad nibble(s) and 0 bad bit(s), but can be recovered
ending istep 13.11
starting istep 13.13
ending istep 13.13
starting istep 14.1
MCBIST0 took 613633 us
ending istep 14.1
starting istep 14.3
Initializing PEC0...
Initializing PEC1...
Initializing PEC2...
Initializing PHB0...
Initializing PHB1...
Initializing PHB2...
Initializing PHB3...
Initializing PHB4...
Initializing PHB5...
ending istep 14.3
starting istep 14.5
ending istep 14.5
0xF000F = 221d104900008040
CBMEM:
IMD: root @ 0xffeff000 254 entries.
IMD: root @ 0xffefec00 62 entries.
FMAP: area COREBOOT found @ 20200 (1965568 bytes)
FFS header at 0x80060300ffff7000
PNOR base at 0x80060300fc000000
HBI partition has ECC
HBI is in 0x00426200 through 0x0175f037
CBFS: Found 'fallback/ramstage' @0xe7c0 size 0xd34c in mcache @0xf8231080
BS: romstage times (exec / console): total (unknown) / 16 ms

coreboot-4.14-400-g372ee3d300 Thu Dec  9 13:01:48 UTC 2021 ramstage starting (log level: 7)...
Enumerating buses...
Root Device scanning...

Was able to root@talos:~# while read scom; do if [[ "$scom" == "0x"* ]]; then pdbg -P pib0 getscom $scom; else echo "$scom"; fi; done < /tmp/fir_scoms.txt > /tmp/scom_dump.log

Which may or not be helpful here.

user@talos-tests:~$ scp root@talos:/var/log/obmc-console.log obmc-console.log
root@talos's password: 
obmc-console.log                                                                                                    100%   15KB 886.6KB/s   00:00    
user@talos-tests:~$ scp root@talos:/tmp/scom_dump.log scom_dump.log
root@talos's password: 
scom_dump.log                                                                                                       100%  114KB   2.0MB/s   00:00 

obmc-console.log scom_dump.log

krystian-hebel commented 2 years ago

Will take a closer look later, but now it definitely is a different issue than before, and different than #80. The one from previous comment reports recoverable error in cache chiplet, while #80 reported checkstop for core chiplet, although in a core that is connected to the same cache chiplet.

For now I'll wait for info from my supervisors as to what to do with bad 0.4.1 binaries released, then we will decide if we want to continue debugging here or open new issue.

macpijan commented 2 years ago

We believe we observe the same issue here: https://github.com/Dasharo/dasharo-issues/issues/80 as in this one.

The problem is rather not the dual CPU itself - in such a case, it would work just fine on the v0.4.1 version provided by @krystian-hebel

The reported problem is most likely related to the memory (and/or CPU? - you've got slightly different - older - revision).

Any chance you've got more memory modules do try out, or can use different slots, as suggested here: https://github.com/Dasharo/dasharo-issues/issues/80#issuecomment-1121042277

Of course, hostboot deals with this setup, so this should be fixable on the firmware level. It is a matter of fiding out the root cause.

tlaurion commented 2 years ago

Basically, I think this issue can be closed while HCL is published on dasharo universe, specifying the platform that was tested (CPUs memory and board revision).

Otherwise, #80, in my test case, would be a ~duplicate and that release will continue to not work.

We cannot change the past (0.4.1) where 0.5 tests will lead to a newer release.

tlaurion commented 2 years ago

@macpijan this issue should be closed