Open jian265 opened 6 years ago
When trigger a memory corrected error,it can produce the mcelog,which includes sockets,channel of the fail memory. But the dimm-error-trigger get the result, whose socket number is 0, channel number is 1, and the dimm number is -1. It just means it can't find the right dimm number.
And another question, how does mce get the dimm number? Thank you!
in the file skylake_xeon.c, skylake_s_decode_model decode the memory error by mc13, but intel replys that be changed to CSR, mc13 is be reserved. case 16: case 17: case 18: Wprintf("MemCtrl: "); if (EXTRACT(status, 27, 27)) decode_bitfield(status, memctrl_mc13); else decode_bitfield(status, mc_bits); break;
On Skylake machine check banks 13-18 are used to report errors from the memory controllers (2 memory controllers on each socket, 3 channels on each memory controller ... so 6 banks needed in total).
mcelog hasn't been able to convert addresses to DIMMs for a few generations of CPUs because interleaving of addresses between memory controllers and channels isn't reported in the machine check bank.
Load the skx_edac.ko Linux EDAC driver if you need errors decoded to a specific DIMM.
More information please. e.g. a log?