Dasharo / dasharo-issues

The Dasharo issue tracker
https://dasharo.com/
25 stars 0 forks source link

Dasharo does not work on KGPE-D16 with two CPUs #90

Open miczyg1 opened 2 years ago

miczyg1 commented 2 years ago

Dasharo version Dasharo for KGPE-D16

Dasharo variant KGPE-D16

Affected component(s) or functionality coreboot boot process

Brief summary coreboot resets during ECC memory initialization when two CPU sockets are populated

How reproducible Always

How to reproduce

Steps to reproduce the behavior:

  1. Flash Dasharo on KGPE-D16 platform with two CPUs populated
  2. Power o the platform
  3. Observe the reset loop during ECC memory initialization on the serial console.

Expected behavior The platform can boot to OS

Actual behavior The platform does not boot

Screenshots None

Additional context Maybe check the CMOS options and its default values? SOme scrubber settings may be off etc.

Solutions you've tried None

mrothfuss commented 2 years ago

this fixed it in my tests

https://github.com/Dasharo/coreboot/pull/116/commits/65ed5e1d015cbc0f7729f6103fc87b4d03a63b64

krystian-hebel commented 2 years ago

@mrothfuss unfortunately Your fix doesn't work for us, platform still reboots on first access to any of the PCI devices created by second CPU after scrubbing is enabled. We also haven't seen negative DQS recovery delay detected! on our platform, so it is probably a different issue.

With scrubbing disabled, it reboots soon after starting Linux, before anything is printed by it.

mrothfuss commented 2 years ago

Darn. I was hoping to contribute something.

In case it helps, I was testing with dual 6386's using the latest ucode. Dasharo would loop on ram training similar to what miczyg reported, but could eventually succeed. Dasharo+patch has worked fine under this setup (no boot issues or runtime instability).

mrothfuss commented 2 years ago

I was able to boot a D16 ROM provided by 3mdeb with dual CPUs without issue (hardware). I'm betting it has something to do with memory training. I did a partial audit of the memory initialization code comparing to the BKDG ... there are deviations. Looking at DQS timing results across many boots, some lanes were consistent while others had a wide distribution of values.

mrothfuss commented 1 year ago

This is probably related to #47; having faulty raminit as the underlying problem.

AGESA Fam15 code suggests that seeds for DQS Receiver Enable Training should be extensively determined for each motherboard. Seeds can be configured uniquely for every possible socket, channel, dimm, and byte lane combination. The raptor raminit code is only using the recommended seeds from Table 99 of the BKDG.

I am using an alternate AGESA algorithm: a "Seedless" training method that does not require configuration. So far it has performed flawlessly on two boards (1xC32 + 4x32GB, 2xG34 + 16x32GB). It looks like this algorithm is designed to determine the seeds to allow proper configuration of the "normal" training method -- which I assume is faster at runtime.

See MemTRdPosWithRxEnDlySeeds3() in vendorcode/amd/agesa/f15/Proc/Mem/Tech/mtthrcSeedTrain.c for details.

Another detail to be aware of: the raptor raminit code deviates from the BKDG and does not perform multiple passes of memory training according to the specification.