Open mysnoopy opened 3 years ago
Was your kernel built with CONFIG_RAS_CEC=y? If so, you may have some corrected memory errors that were handled by the RAS_CEC code and not passed though to mcelog.
Same problem here.
# grep CONFIG_RAS_CEC /boot/config-5.4.84-2.el7.x86_64
# CONFIG_RAS_CEC is not set
# lsmod|grep -c edac
0
# mcelog --client
# mcelog --version
mcelog mcelog-144-9.94d853b2ea81.el7
# grep -vE "^#|^$" /etc/mcelog/mcelog.conf
no-imc-log = yes
filter = yes
filter-memory-errors = yes
[server]
client-user = root
[dimm]
dimm-tracking-enabled = yes
dmi-prepopulate = yes
uc-error-trigger = dimm-error-trigger
ce-error-trigger = dimm-error-trigger
uc-error-threshold = 1 / 24h
ce-error-threshold = 1000 / 24h
[socket]
socket-tracking-enabled = yes
mem-uc-error-threshold = 1000 / 24h
mem-ce-error-threshold = 1000 / 24h
mem-ce-error-log = yes
[cache]
cache-threshold-log = yes
[page]
memory-ce-threshold = 10 / 24h
memory-ce-log = no
memory-ce-action = off
[trigger]
children-max = 2
directory = /etc/mcelog/triggers
CPU Intel E5-2680 v3 In /var/log/mcelog only "failed to prefill DIMM database from DMI data". And still a lot of errors in kernel log:
Dec 7 16:01:31 srv kernel: [ 6.653750] mce: [Hardware Error]: Machine check events logged
Dec 7 16:01:31 srv kernel: [ 6.653820] mce: CMCI storm detected: switching to poll mode
Dec 7 16:01:31 srv kernel: [ 6.654715] mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 7: cc163b8000010090
Dec 7 16:01:31 srv kernel: [ 6.657713] mce: [Hardware Error]: TSC 0 ADDR 3f7f308140 MISC 42363686
Dec 7 16:01:31 srv kernel: [ 6.658714] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1638882045 SOCKET 1 APIC 20 microcode 36
...
Dec 7 16:01:31 srv kernel: [ 6.713718] mce: MCE records pool full!
All those mce errors in the log appeared only on boot (probably before mcelog started). Could it be, that early "CMCI storm detected: switching to poll mode" during the boot effectiveley turns off passing MCE to mcelog?
Old kernels didn't pass early errors to mcelog.
There is a fix in v5.15
See this commit: 3bff147b187d ("x86/mce: Defer processing of early errors")
Found the mce error on dmesg. But mcelog didn't catch it and /var/log/mcelog is empty,
[root@test ~]#dmesg -T |grep mce [Tue Apr 21 16:02:26 2020] mce: Using 22 MCE banks [Sat May 1 08:56:53 2021] mce: [Hardware Error]: Machine check events logged
[root@test ~]# mcelog --client [root@test ~]# cat /var/log/mcelog [root@test ~]#
[root@test ~]# cat /etc/mcelog/mcelog.conf #
config file for mcelog
For further options, see the mcelog manpage and documentation
#
by default, disable extended error logging on newer Intel processors
syslog = yes
logfile = /var/log/mcelog
no-imc-log = yes
Filter out known broken events by default
filter = yes
don't log memory errors individually
filter-memory-errors = yes
output in undecoded raw format to be easier machine readable
raw = yes
[server]
An upstream bug prevents this from being disabled
Only allow root to connect by default
client-user = root
Path to socket client uses to connect
socket-path = /var/run/mcelog-client
[dimm]
Enable DIMM-tracking
dimm-tracking-enabled = yes
Disable DIMM DMI pre-population unless supported on your system
dmi-prepopulate = no
execute these triggers when the rate of corrected or uncorrected
errors per DIMM exceeds the threshold
uc-error-trigger = dimm-error-trigger uc-error-threshold = 1 / 24h ce-error-trigger = dimm-error-trigger ce-error-threshold = 10 / 24h
[socket]
Memory error accounting per socket
socket-tracing-enabled = yes mem-uc-error-threshold = 100 / 24h mem-ce-error-trigger = socket-memory-error-trigger mem-ce-error-threshold = 100 / 24h mem-ce-error-log = yes
[cache]
Attempt to off-line CPUs causing cache errors
cache-threshold-trigger = cache-error-trigger cache-threshold-log = yes
[page]
Try to soft-offline a 4K page if it exceeds the threshold
memory-ce-threshold = 10 / 24h memory-ce-trigger = page-error-trigger memory-ce-log = yes memory-ce-action = soft
[trigger]
Maximum number of running triggers
children-max = 2 directory = /etc/mcelog/triggers [root@test ~]#