andikleen / mcelog

Linux kernel machine check handling middleware
http://www.mcelog.org
GNU General Public License v2.0
136 stars 62 forks source link

mcelog test failures #8

Open LarryBaker opened 11 years ago

LarryBaker commented 11 years ago

I have run the mcelog tests on two test machines in my lab: an Intel Atom and an Intel Xeon. Both have a couple failures (not bad).

Atom:

[root@atompc tests]# make test ./test cache "" ++++++++++++ running cache test +++++++++++++++++++ mcelog: cache.c:92: parse_cpumap: Assertion `len == c * sizeof(unsigned)' failed. ./test: line 42: 3198 Aborted $D ../../mcelog --foreground --daemon --debug-numerrors --config $conf --logfile $log >> result

[root@atompc tests]# cat */results cache.conf: no triggers at all cache.conf: triggers did not trigger as expected: 2 != 0

Xeon: [root@rincon1-ew tests]# cat */results socket-1.conf: triggers did not trigger as expected: 2 != 4 socket-2.conf: triggers did not trigger as expected: 1 != 2 socket-memdb.conf: triggers did not trigger as expected: 4 != 6 The O/S on both systems is CentOS 6.4 x86_64, and I wrote these instructions to run the mcelog tests: Testing mcelog triggers requires mce-inject from the ras-utils package and page-types.c from the kernel-doc package. # yum install ras-utils kernel-doc # cd /usr/share/doc/kernel-doc-2.6.32/Documentation/vm # gcc -o page-types page-types.c # mv page-types /usr/bin/ Run the mcelog package test suite. # cd /root/rpmbuild/BUILD/mcelog-1.0pre3_20120814_2/tests # service mcelogd stop # ln -s /usr/sbin/mcelog ../mcelog # modprobe mce-inject # make clean # make test The results are below. Thanks, Larry Baker ----- Atom ----- [root@atompc tests]# rpm -q -a | grep mce mcelog-1.0pre3_20120814_2-0.6.el6.x86_64 [root@atompc tests]# more /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 28 model name : Intel(R) Atom(TM) CPU D525 @ 1.80GHz stepping : 10 cpu MHz : 1799.899 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant _tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl tm2 sss e3 cx16 xtpr pdcm movbe lahf_lm dts bogomips : 3599.79 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 --More--(0%) [root@atompc tests]# make clean rm -f _/_log rm -f */results [root@atompc tests]# make test ./test cache "" ++++++++++++ running cache test +++++++++++++++++++ mcelog: no process killed mcelog: cache.c:92: parse_cpumap: Assertion `len == c \* sizeof(unsigned)' failed. ./test: line 42: 3198 Aborted $D ../../mcelog --foreground --daemon --debug-numerrors --config $conf --logfile $log >> result ./test page "" ++++++++++++ running page test +++++++++++++++++++ mcelog: no process killed ./test memdb "" ++++++++++++ running memdb test +++++++++++++++++++ mcelog: no process killed ./test socket "" ++++++++++++ running socket test +++++++++++++++++++ mcelog: no process killed ./test pfa "" ++++++++++++ running pfa test +++++++++++++++++++ mcelog: no process killed +++ start the injection for page-account.conf +++ inject for page type slab at physical address 0x13ca44000 [ NO. 0 ] inject for page type slab at physical address 0x13ca44000 [ NO. 1 ] +++ start the injection for page-hard.conf +++ inject for page type slab at physical address 0x137d77000 [ NO. 0 ] inject for page type slab at physical address 0x137d77000 [ NO. 1 ] +++ start the injection for page-soft.conf +++ inject for page type slab at physical address 0x12a355000 [ NO. 0 ] inject for page type slab at physical address 0x12a355000 [ NO. 1 ] +++ start the injection for page-soft-then-hard.conf +++ inject for page type slab at physical address 0x1255a7000 [ NO. 0 ] [root@atompc tests]# cat */results cache.conf: no triggers at all cache.conf: triggers did not trigger as expected: 2 != 0 memdb-1.conf: triggers trigger as expected memdb-2.conf: triggers trigger as expected page-account.conf: triggers trigger as expected page-hard.conf: triggers trigger as expected page-memdb.conf: triggers trigger as expected page-off.conf: triggers trigger as expected page-soft.conf: triggers trigger as expected page-soft-then-hard.conf: triggers trigger as expected page-account.conf: triggers trigger as expected page-hard.conf: triggers trigger as expected page-soft.conf: triggers trigger as expected page-soft-then-hard.conf: triggers trigger as expected socket-1.conf: triggers trigger as expected socket-2.conf: triggers trigger as expected socket-memdb.conf: triggers trigger as expected ----- Xeon ----- [root@rincon1-ew tests]# rpm -q -a | grep mce mcelog-1.0pre3_20120814_2-0.6.el6.x86_64 [root@rincon1-ew tests]# more /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU L5630 @ 2.13GHz stepping : 2 cpu MHz : 1596.000 cache size : 12288 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdt scp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmp erf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pci d dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dts tpr_shadow vnmi flexpriority ept vpid bogomips : 4267.09 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 1 --More--(0%) [root@rincon1-ew tests]# make clean rm -f _/_log rm -f */results [root@rincon1-ew tests]# make test ./test cache "" ++++++++++++ running cache test +++++++++++++++++++ mcelog: no process killed ./test page "" ++++++++++++ running page test +++++++++++++++++++ mcelog: no process killed ./test memdb "" ++++++++++++ running memdb test +++++++++++++++++++ mcelog: no process killed ./test socket "" ++++++++++++ running socket test +++++++++++++++++++ mcelog: no process killed ./test pfa "" ++++++++++++ running pfa test +++++++++++++++++++ mcelog: no process killed +++ start the injection for page-account.conf +++ inject for page type slab at physical address 0x160096000 [ NO. 0 ] inject for page type slab at physical address 0x160096000 [ NO. 1 ] +++ start the injection for page-hard.conf +++ inject for page type slab at physical address 0x177742000 [ NO. 0 ] inject for page type slab at physical address 0x177742000 [ NO. 1 ] +++ start the injection for page-soft.conf +++ inject for page type slab at physical address 0x15556e000 [ NO. 0 ] inject for page type slab at physical address 0x15556e000 [ NO. 1 ] +++ start the injection for page-soft-then-hard.conf +++ inject for page type slab at physical address 0x179bfa000 [ NO. 0 ] [root@rincon1-ew tests]# cat */results cache.conf: triggers trigger as expected memdb-1.conf: triggers trigger as expected memdb-2.conf: triggers trigger as expected page-account.conf: triggers trigger as expected page-hard.conf: triggers trigger as expected page-memdb.conf: triggers trigger as expected page-off.conf: triggers trigger as expected page-soft.conf: triggers trigger as expected page-soft-then-hard.conf: triggers trigger as expected page-account.conf: triggers trigger as expected page-hard.conf: triggers trigger as expected page-soft.conf: triggers trigger as expected page-soft-then-hard.conf: triggers trigger as expected socket-1.conf: triggers did not trigger as expected: 2 != 4 socket-2.conf: triggers did not trigger as expected: 1 != 2 socket-memdb.conf: triggers did not trigger as expected: 4 != 6
LarryBaker commented 11 years ago

I found the test errors on the Atom were due to hyper threading being disabled. mcelog assumes if there is a /sys/devices/system/cpu/cpun, there is a cache entry. On CentOS (Red Hat) 6.4, writing 1 to /sys/devices/system/cpu/cpun/online enables the hyper threading processors and the mcelog tests work.

I do not know if this is a bug in the tests or a bug in mcelog.

LarryBaker commented 11 years ago

The bug is in cache.c in mcelog. stat(fn, &st); at line 121 should be turned into if (!stat(fn, &st)) {...} free(fn);.

I tried including the patch, but this HTML markup completely screws it up. At least allow for attachments!

zhijianli88 commented 7 years ago

I also got the failure on my haswell platform, but i don't figure out why socket-1.conf: triggers did not trigger as expected: 2 != 4 socket-2.conf: triggers did not trigger as expected: 1 != 2 socket-memdb.conf: triggers did not trigger as expected: 4 != 6