andikleen / mcelog

Linux kernel machine check handling middleware
http://www.mcelog.org
GNU General Public License v2.0
136 stars 63 forks source link

MCE can't trigger dimm error when error occurs on the purley platform #67

Open jian265 opened 6 years ago

andikleen commented 6 years ago

More information please. e.g. a log?

jian265 commented 6 years ago

When trigger a memory corrected error,it can produce the mcelog,which includes sockets,channel of the fail memory. But the dimm-error-trigger get the result, whose socket number is 0, channel number is 1, and the dimm number is -1. It just means it can't find the right dimm number.

jian265 commented 6 years ago

And another question, how does mce get the dimm number? Thank you!

jian265 commented 6 years ago

in the file skylake_xeon.c, skylake_s_decode_model decode the memory error by mc13, but intel replys that be changed to CSR, mc13 is be reserved. case 16: case 17: case 18: Wprintf("MemCtrl: "); if (EXTRACT(status, 27, 27)) decode_bitfield(status, memctrl_mc13); else decode_bitfield(status, mc_bits); break;

aegl commented 6 years ago

On Skylake machine check banks 13-18 are used to report errors from the memory controllers (2 memory controllers on each socket, 3 channels on each memory controller ... so 6 banks needed in total).

mcelog hasn't been able to convert addresses to DIMMs for a few generations of CPUs because interleaving of addresses between memory controllers and channels isn't reported in the machine check bank.

Load the skx_edac.ko Linux EDAC driver if you need errors decoded to a specific DIMM.