andikleen / mcelog

Linux kernel machine check handling middleware
http://www.mcelog.org
GNU General Public License v2.0
131 stars 62 forks source link

mcelog didn't catch the mce memory #91

Open mysnoopy opened 3 years ago

mysnoopy commented 3 years ago

Found the mce error on dmesg. But mcelog didn't catch it and /var/log/mcelog is empty,

[root@test ~]#dmesg -T |grep mce [Tue Apr 21 16:02:26 2020] mce: Using 22 MCE banks [Sat May 1 08:56:53 2021] mce: [Hardware Error]: Machine check events logged

[root@test ~]# mcelog --client [root@test ~]# cat /var/log/mcelog [root@test ~]#

[root@test ~]# cat /etc/mcelog/mcelog.conf #

config file for mcelog

For further options, see the mcelog manpage and documentation

#

by default, disable extended error logging on newer Intel processors

syslog = yes

logfile = /var/log/mcelog

no-imc-log = yes

Filter out known broken events by default

filter = yes

don't log memory errors individually

filter-memory-errors = yes

output in undecoded raw format to be easier machine readable

raw = yes

[server]

An upstream bug prevents this from being disabled

Only allow root to connect by default

client-user = root

Path to socket client uses to connect

socket-path = /var/run/mcelog-client

[dimm]

Enable DIMM-tracking

dimm-tracking-enabled = yes

Disable DIMM DMI pre-population unless supported on your system

dmi-prepopulate = no

execute these triggers when the rate of corrected or uncorrected

errors per DIMM exceeds the threshold

uc-error-trigger = dimm-error-trigger uc-error-threshold = 1 / 24h ce-error-trigger = dimm-error-trigger ce-error-threshold = 10 / 24h

[socket]

Memory error accounting per socket

socket-tracing-enabled = yes mem-uc-error-threshold = 100 / 24h mem-ce-error-trigger = socket-memory-error-trigger mem-ce-error-threshold = 100 / 24h mem-ce-error-log = yes

[cache]

Attempt to off-line CPUs causing cache errors

cache-threshold-trigger = cache-error-trigger cache-threshold-log = yes

[page]

Try to soft-offline a 4K page if it exceeds the threshold

memory-ce-threshold = 10 / 24h memory-ce-trigger = page-error-trigger memory-ce-log = yes memory-ce-action = soft

[trigger]

Maximum number of running triggers

children-max = 2 directory = /etc/mcelog/triggers [root@test ~]#

aegl commented 3 years ago

Was your kernel built with CONFIG_RAS_CEC=y? If so, you may have some corrected memory errors that were handled by the RAS_CEC code and not passed though to mcelog.

dimaslv commented 2 years ago

Same problem here.

# grep CONFIG_RAS_CEC /boot/config-5.4.84-2.el7.x86_64 
# CONFIG_RAS_CEC is not set
# lsmod|grep -c edac
0
# mcelog --client
# mcelog --version
mcelog mcelog-144-9.94d853b2ea81.el7
# grep -vE "^#|^$" /etc/mcelog/mcelog.conf 
no-imc-log = yes
filter = yes
filter-memory-errors = yes
[server]
client-user = root
[dimm]
dimm-tracking-enabled = yes
dmi-prepopulate = yes
uc-error-trigger = dimm-error-trigger
ce-error-trigger = dimm-error-trigger
uc-error-threshold = 1 / 24h
ce-error-threshold = 1000 / 24h
[socket]
socket-tracking-enabled = yes
mem-uc-error-threshold = 1000 / 24h
mem-ce-error-threshold = 1000 / 24h
mem-ce-error-log = yes
[cache]
cache-threshold-log = yes
[page]
memory-ce-threshold = 10 / 24h
memory-ce-log = no
memory-ce-action = off
[trigger]
children-max = 2
directory = /etc/mcelog/triggers

CPU Intel E5-2680 v3 In /var/log/mcelog only "failed to prefill DIMM database from DMI data". And still a lot of errors in kernel log:

Dec  7 16:01:31 srv kernel: [    6.653750] mce: [Hardware Error]: Machine check events logged
Dec  7 16:01:31 srv kernel: [    6.653820] mce: CMCI storm detected: switching to poll mode
Dec  7 16:01:31 srv kernel: [    6.654715] mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 7: cc163b8000010090
Dec  7 16:01:31 srv kernel: [    6.657713] mce: [Hardware Error]: TSC 0 ADDR 3f7f308140 MISC 42363686 
Dec  7 16:01:31 srv kernel: [    6.658714] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1638882045 SOCKET 1 APIC 20 microcode 36
...
Dec  7 16:01:31 srv kernel: [    6.713718] mce: MCE records pool full!

All those mce errors in the log appeared only on boot (probably before mcelog started). Could it be, that early "CMCI storm detected: switching to poll mode" during the boot effectiveley turns off passing MCE to mcelog?

aegl commented 2 years ago

Old kernels didn't pass early errors to mcelog.

There is a fix in v5.15

See this commit: 3bff147b187d ("x86/mce: Defer processing of early errors")