Closed tlaurion closed 2 years ago
Sorry for the delay, I was busy with other projects and then reading provided log (almost 200k lines). Based on it, and additional SCOM dumps performed on platform, I have some rough idea of what happens. There were actually multiple errors (11 to be exact) reported in FIR, but some of them probably were caused by trying to load exception handler that does not exist.
We messed up a bit during simplification done to various isteps, like this one. As a result, a bit that tells that current CPU is the only one in the system is not set. Because of that, when code tries to write to RAM, a message is sent to other CPU that it should flush and invalidate its L3 cache. That other CPU does not exist, so main CPU doesn't receive an acknowledgement to that message. After quick and dirty fix in place (hardcoding it to 1 CPU version) "only" 4 errors were reported.
Further issues were caused by misunderstanding what ATTR_PROC_FABRIC_X_LINKS_CNFG
is - we thought it is total number of X links, instead it is number of X links in use. For Nimbus this is 1 when there are 2 CPUs, 0 otherwise. With some additional hacks we got down to 2 reported errors, and a bit different manifestation of the problem - now platform doesn't reboot, instead it gets into infinite loop, but this may be caused by different layout of code with regard to cache lines.
I also noticed some differences that will become important when we get to OCC initialization, but we're not that far just yet. Also, if OCC was (partially) started earlier, it should be able to gather FIR SCOM, basically what I tried to get with scripts few comments above. This would however require starting OCC (at least) twice, and as I haven't thought it would be necessary, we decided to skip it in coreboot. AFAIK there is no ready-to-use tool for parsing those dumps.
Right now we're about to remove one CPU from our platform and see if we get exactly the same issue, as at least some of this problems should happen on every 1 CPU platform, if I understood it correctly.
Update: with additional changes coreboot is able to boot to ramstage, now it stops at No WOF table match found
issue.
Final cause was wrong Power Bus frequency. Hostboot did read it from MVPD which we mimicked in coreboot, but what we haven't noticed is that the value is overwritten with a hardcoded one later. Hardcoded value is identical to value from MVPD for our CPU, which is why it worked in the first place.
WOF is another issue altogether, but I think it shouldn't be too hard to fix. We just didn't thought that there are processors which aren't included in WOF table, but this is what community testing is all about - to catch such corner cases.
Latest tests with unreleased version supported my single CPU!
Current issue now is to have Heads payload output to vga console, which seems to miss either AST+DRM in kernel config and/or proper skiboot passed arguments.
Off-channel notes:
@krystian-hebel I'm searching for the tool that was developped to sit on bmc and collect the logs without nohup. Can you tag me and point me to where it is? That should be added in a debugging page for Talos II board.
I think it was here: https://github.com/3mdeb/openpower-coreboot-docs/pull/74/files
Please let me know what you believe what be the best place to put it
@macpijan : https://docs.dasharo.com/variants/talos_2/overview/ should have minimally a link to https://github.com/3mdeb/openpower-coreboot-docs
But a debugging page draft would be useful, pointing directly to https://github.com/3mdeb/openpower-coreboot-docs/blob/main/devnotes/scat/README.md to facilitate bug reporting?
Maybe we should move all user-level documentation to docs.dasharo.com
and leave openpower-coreboot-docs
just for developer stuff like logs, istep analysis, early design considerations etc? There is some overlap between those two repositories, e.g. release info or flashing instructions. What's worse, they are slightly different...
Dasharo version 0.5.0 release from https://docs.dasharo.com/variants/talos_2/releases/
Dasharo variant Workstation
Affected component(s) or functionality Memory initialization fails ( 1x M393A1K43BB0-CRC in B1 memory slot)
Brief summary The coreboot output stops at
How reproducible Always (Single 16 cores CPU, one RAM module: 8GB (More info on changes needed to be documented on non-flashing instructions under https://github.com/Dasharo/dasharo-issues/issues/79)
How to reproduce
Steps to reproduce the behavior: On laptop:
On a seperated SSH connection to BMC:
On another seperate SSH connection to BMC:
Expected behavior Ram init succeeds and net steps are engaged
Actual behavior Stops at