andikleen / mcelog

Linux kernel machine check handling middleware
http://www.mcelog.org
GNU General Public License v2.0
132 stars 62 forks source link

mcelog --client output nothing when inject error to memory,But the messages can catch the error log. #74

Open xiaochunlee opened 5 years ago

xiaochunlee commented 5 years ago

Lately, We tested RAS function about memory inject error on the Purley platform of lenovo SR630, The OS is RHEL8,kernel version kernel-4.18.0-67.el8, mcelog version is 159. The test steps list as below:

  1. Mount the einj module linux-1rz0:~ # modprobe einj param_extension=1 linux-1rz0:~ #

  2. Start the mcelog daemon linux-1rz0:~ # mcelog --daemon linux-1rz0:~ #

  3. Check whether the einj module loaded successfully linux-1rz0:~ # cd /sys/kernel/debug/apei/einj/ linux-1rz0:/sys/kernel/debug/apei/einj # linux-1rz0:/sys/kernel/debug/apei/einj # ls available_error_type error_inject error_type flags notrigger param1 param2 param3 param4 vendor vendor_flags linux-1rz0:/sys/kernel/debug/apei/einj #

4.Inject uncorrectable error to memory mirror range linux-1rz0:/sys/kernel/debug/apei/einj # echo 0x10 > error_type linux-1rz0:/sys/kernel/debug/apei/einj # echo 0x12345 > param1 linux-1rz0:/sys/kernel/debug/apei/einj # linux-1rz0:/sys/kernel/debug/apei/einj # echo 0xfffffffffffff000 > param2 linux-1rz0:/sys/kernel/debug/apei/einj # linux-1rz0:/sys/kernel/debug/apei/einj # echo 1 > error_inject linux-1rz0:/sys/kernel/debug/apei/einj # linux-1rz0:/sys/kernel/debug/apei/einj # echo 1 > notrigger linux-1rz0:/sys/kernel/debug/apei/einj #

Below is some informations about the outcome:

[root@rhel8-ose-test rastools]# systemctl status mcelog ● mcelog.service - Machine Check Exception Logging Daemon Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2019-01-30 02:22:29 EST; 51min ago Main PID: 1177 (mcelog) Tasks: 1 (limit: 26213) Memory: 856.0K CGroup: /system.slice/mcelog.service └─1177 /usr/sbin/mcelog --ignorenodev --daemon --foreground Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Error enabled Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_MISC register valid Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_ADDR register valid Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: SRAR Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCA: Data CACHE Level-0 Data-Read Error Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: STATUS bd80000000100134 MCGSTATUS f Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCGCAP f000c14 APICID 17 SOCKETID 0 Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: PPIN 2f5f92f94c7e6989 Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MICROCODE 2000055 Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPUID Vendor Intel Family 6 Model 85

[root@rhel8-ose-test rastools]#tail -f /var/log/dmesg Jan 30 02:29:26 rhel8-ose-test kernel: mce: Uncorrected hardware memory error in user-access at 6696d1040 Jan 30 02:29:26 rhel8-ose-test kernel: mce: [Hardware Error]: Machine check events logged Jan 30 02:29:26 rhel8-ose-test kernel: Memory failure: 0x6696d1: Killing einj_mem_uc:8974 due to hardware memory corruption Jan 30 02:29:26 rhel8-ose-test kernel: Memory failure: 0x6696d1: recovery action for dirty LRU page: Recovered Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Hardware event. This is not a software error. Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCE 0 Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPU 21 BANK 1 TSC 8a4ce5d5aa0 Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: RIP 33:403c4b Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MISC 86 ADDR 6696d1040 Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: TIME 1548833366 Wed Jan 30 02:29:26 2019 Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCG status:RIPV EIPV MCIP LMCE Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi status: Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Uncorrected error Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Error enabled Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_MISC register valid Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_ADDR register valid Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: SRAR Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCA: Data CACHE Level-0 Data-Read Error Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: STATUS bd80000000100134 MCGSTATUS f Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCGCAP f000c14 APICID 17 SOCKETID 0 Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: PPIN 2f5f92f94c7e6989 Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MICROCODE 2000055 Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPUID Vendor Intel Family 6 Model 85

But when we execute the mcelog --client ,there is nothing output. So I research the code about the mcelog and found that it was blocked by mce status bit settings,The partial resource code of the mcelog tool list as below:

125 static int intel_memory_error(struct mce *m, unsigned recordlen) 126 { 127 u32 mca = m->status & 0xffff; 128 if ((mca >> 7) == 1) { 129 unsigned corr_err_cnt = 0; 130 int channel[2] = { (mca & 0xf) == 0xf ? -1 : (int)(mca & 0xf), -1 }; 131 int dimm[2] = { -1, -1 };

xiaochunlee commented 5 years ago

For investigating by UEFI expert, They got a conclusion that the Firmware always handle the error first, So that the OS cant capture that. Such as the handling of PFA threshold and mirror area uncorrected errors. So, I have a question, does the mcelog need to handle that while enabled the firmware first on UEFI? Looking forward to your reply, many thanks!

xiaochunlee commented 5 years ago

Because if the BIOS enable the Firmware First function, the OS can't get the error count correctly, So that the mcelog can't trigger the threshold events that defined by '/etc/mcelog/mcelog.conf ' as usual. The UEFI engineers only want to know how the OS will do when the UEFI handle error first, or whether the UEFI should enable the Firmware First, if enable it, whether it is reasonable. Is there anyone give a help? Appreciate it!

xiaochunlee commented 5 years ago

As we know, The mcelog daemon accounts memory errors, mcelog --client can be used to query the count of corrected and uncorrected errors. The mcelog daemon can also execute triggers when configurable error thresholds are exceeded. This is used to implement a range of automatic predictive failure analysis algorithms: including bad page offlining and automatic cache error handling. User defined actions can be also configured. If the UEFI enable the FIRMWARE_FIRST function(defined by ACPI spec), UEFI handles errors first, not even notifying OS, it result in the thresholds event which defined in ‘/etc/mcelog/mcelog.conf’ can’t be triggered, So I think it's an abnormal behavior.
Anyway, the mcelog daemon accounts inaccurate memory error while test RAS features after enable FIRMWARE_FIRST in UEFI.So if enable FIRMWARE_FIRST in UEFI, when we run the command ‘mcelog –client’ , there is nothing output. After researching the code about the mcelog, it was blocked at MCE status bit checking,The partial resource code of the mcelog tool list as below:

125 static int intel_memory_error(struct mce *m, unsigned recordlen) 126 { 127 u32 mca = m->status & 0xffff; 128 if ((mca >> 7) == 1) { 129 unsigned corr_err_cnt = 0; 130 int channel[2] = { (mca & 0xf) == 0xf ? -1 : (int)(mca & 0xf), -1 }; 131 int dimm[2] = { -1, -1 };

It seems that after enable the FIRMWARE_FIRST, UEFI handle errors first when errors occurred, But This behavior causes a change in the status of the MCA register, so mcelog can’t get the correct information, Ultimately, the user-defined event could not be triggered.

ZSN666 commented 2 years ago

hi [xiaochunlee]: I ran into the same problem,How do you solve it now ?mcelog --client shows nothing but /var/log/mcelog records mce event,finally triggers not invoked。@xiaochunlee

xiaochunlee commented 2 months ago

hi [xiaochunlee]: I ran into the same problem,How do you solve it now ?mcelog --client shows nothing but /var/log/mcelog records mce event,finally triggers not invoked。@xiaochunlee

Command "mcelog --client" shows logs huge possibility due to the MCA status register bit 7(start from bit 0). Eg. In the latest Intel Birch Stream, Sierra Forest CPU, when UEFI don't gather MCA data first into GHES (WHEA table) to notify OS, the MCA status register always is 0xbd80000000070134 in SRAR testing, the bit 7 is 0, so nothing shows up while running "mcelog --client", follow "SFR Registers Specification reversion 1.2", chapter "Machine Check Architecture (MCA) Encoding". bit 0-15 is MCACOD, the 0x0134 means "[CE/UC/SRAR] Data Cache Level 0 - Data Read Error", so it looks everything good, but "mcelog --client" show log prerequisite is bit 7 of MCA status must be 1, I don't know why have this design, @Tony Zhu tony.zhu@intel.com, Could you please help to let me know the really reason or the background? Many thanks! When UEFI does fill the WHEA (Windows Hardware Error Architecture) table and notify OS via GHES (Generic Hardware Error Source) notify type, such as NMI or SCI, the MCA register always be 0xbc0000000000009f, obvious, the bit 7 is 1, so the "mcelog --client" have message output under this situation. Now, In Lenovo ThinkSystem SR650V4 with Intel Birch Stream, Sierra Forest CPU, Lenovo UEFI designs don't follow Intel new design to store error info to Extend log, and drop original design that store error info to WHEA as well, so the MCA status is 0xbd80000000070134 after SRAR testing and no output while running command "mcelog --client". @Tony Zhu tony.zhu@intel.com Is that a normal behavior, do you have any idea on that? Appreciate it!

xiaochunlee commented 2 months ago

I've got more information about those, update as follows. Intel Architecture spec "Table 16-10. IA32_MCi_Status [15:0] Compound Error Code Encoding" says that only Memory Controller Errors and "Extended Memory Errors" (memory of 1-level memory, slow far memory of 2-level memory and fast near memory of 2-level memory) has bit#7 be set to 1. So both EDAC driver in kernel and mcelog in APP have logical judgement of MCi status bit#7, they just parse memory error. In my case, I use SRAR (Software Recoverable Action Required) test case, that triggered DCU error is a "Data CACHE Level-0 Data-Read Error", it belongs to the category of "Cache Hierarchy Errors", not the "Memory Controller Errors" we mentioned. That's why EDAC driver and mcelog show nothing while SRAR injected in. The kernel annotation listed as below. But I'm still confuse that why WHEA be filled by UEFI in FFM (Firmware First Mode) expose to OS MCi Status have changed to 0xbc0000000000009f from 0xbd80000000070134 while SRAR tested, Maybe your UEFI guys or Intel could tell us. -->> /*

aegl commented 2 months ago

The 0xbc0000000000009f status is provided by the Linux function apei_mce_report_mem_error() because the error record sent to Linux in the firmware first case does not include information from the machine check bank.

                   mce_setup(&m);
    m.bank = -1;
    /* Fake a memory read error with unknown channel */
    m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | MCI_STATUS_MISCV | 0x9f;
    m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;

    if (severity >= GHES_SEV_RECOVERABLE)
            m.status |= MCI_STATUS_UC;

    if (severity >= GHES_SEV_PANIC) {
            m.status |= MCI_STATUS_PCC;
            m.tsc = rdtsc();
    }

    m.addr = mem_err->physical_addr;

-Tony

From: Xavier Lee @.> Sent: Friday, June 14, 2024 4:15 AM To: andikleen/mcelog @.> Cc: Subscribed @.***> Subject: Re: [andikleen/mcelog] mcelog --client output nothing when inject error to memory,But the messages can catch the error log. (#74)

I've got more information about those, update as follows. Intel Architecture spec "Table 15-10. IA32_MCi_Status [15:0] Compound Error Code Encoding" says that only Memory Controller Errors (memory of 1-level memory, slow far memory of 2-level memory and fast near memory of 2-level memory) has bit#7 be set to 1. So both EDAC driver in kernel and mcelog in APP have logical judgement of MCi status bit#7, they just parse memory error. In my case, I use SRAR (Software Recoverable Action Required) test case, that triggered DCU error is a "Data CACHE Level-0 Data-Read Error", it belongs to the category of "Cache Hierarchy Errors", not the "Memory Controller Errors" we mentioned. That's why EDAC driver and mcelog show nothing while SRAR injected in. The kernel annotation listed as below. But I'm still confuse that why WHEA be filled by UEFI in FFM (Firmware First Mode) expose to OS MCi Status have changed to 0xbc0000000000009f from 0xbd80000000070134 while SRAR tested, Maybe your UEFI guys or Intel could tell us. -->> /*

· 000f 0000 1mmm cccc (binary)

*

· 000f 0010 1mmm cccc (binary) [RAM used as cache]

*

· f = Correction Report Filtering Bit. If 1, subsequent errors

*

· won't be shown

*

· mmm = error type

*

· cccc = channel

*

/

define MCACOD_MEM_ERR_MASK 0xef80

/

— Reply to this email directly, view it on GitHubhttps://github.com/andikleen/mcelog/issues/74#issuecomment-2167803157, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABJRTBMSJJZH23GWMJGAAK3ZHLGBZAVCNFSM6AAAAABJIAJJW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRXHAYDGMJVG4. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

xiaochunlee commented 2 months ago

@aegl, Thanks for your clarification. Yes, is a fake mca bank number here, and just the CPER section in the Generic Error Status Block sent to OS. But I was just wondering in function apei_mce_report_mem_error(), if set the mca status register MCACOD to 0x009f, and then it notify the decoders that registered on "x86_mce_decoder_chain", include those EDAC notifier and mcelog notifier, so the EDAC and mcelog find that it is a valid memory error via bit#7 in mca status register and start to decode the mce information, but in fact, it is a cache error, such as SRAR DCU test in access application address, the output looks like below. The MCA status register is 0xbd80000000100134 is a "Data CACHE Level-0 Data-Read Error" but become 0xb40000000000009f via GHES notify OS, due to EDAC driver and mcelog be notified and start decoding to "MEMORY CONTROLLER RD_CHANNELunspecified_ERR" for mcelog, and "EDAC MC0: 1 UE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xe16677 offset:0x0 grain:32 - err_code:0000:009f socket:0 imc:0 rank:0 bg:0 ba:3 row:dad3 col:3c0)" for EDAC. Are those interpreted output right or not? I mean other interpreted results except bank number (-1). Actually, it is a Cache Hierarchy Errors, but in the UEFI FFM scennario, kernel treat is as a Memory error, is that correct, do we need to enhance them? Many thanks! @aegl

-->> 2019-08-19T04:17:35.088436-04:00 linux-dgwd mcelog: mcelog server already running 2019-08-19T04:17:35.096032-04:00 linux-dgwd kernel: [ 198.093202] EINJ: Error INJection is initialized. 2019-08-19T04:18:34.201005-04:00 linux-dgwd kernel: [ 257.199597] Disabling lock debugging due to kernel taint 2019-08-19T04:18:34.201256-04:00 linux-dgwd mcelog[2340]: Hardware event. This is not a software error. 2019-08-19T04:18:34.201302-04:00 linux-dgwd kernel: [ 257.199615] mce: Uncorrected hardware memory error in user-access at e16677000 2019-08-19T04:18:34.201304-04:00 linux-dgwd kernel: [ 257.199687] mce: [Hardware Error]: Machine check events logged 2019-08-19T04:18:34.201304-04:00 linux-dgwd kernel: [ 257.199882] Memory failure: 0xe16677: Killing mca-recover:4801 due to hardware memory corruption 2019-08-19T04:18:34.201305-04:00 linux-dgwd kernel: [ 257.199885] Memory failure: 0xe16677: recovery action for dirty LRU page: Recovered 2019-08-19T04:18:34.201505-04:00 linux-dgwd mcelog[2340]: MCE 0 2019-08-19T04:18:34.201595-04:00 linux-dgwd mcelog[2340]: CPU 3 BANK 1 TSC c76107cf84 2019-08-19T04:18:34.201677-04:00 linux-dgwd mcelog[2340]: RIP 33:404c36 2019-08-19T04:18:34.201753-04:00 linux-dgwd mcelog[2340]: MISC 86 ADDR e16677000 2019-08-19T04:18:34.201832-04:00 linux-dgwd mcelog[2340]: TIME 1566202714 Mon Aug 19 04:18:34 2019 2019-08-19T04:18:34.201913-04:00 linux-dgwd mcelog[2340]: MCG status:RIPV EIPV MCIP LMCE 2019-08-19T04:18:34.201990-04:00 linux-dgwd mcelog[2340]: MCi status: 2019-08-19T04:18:34.202066-04:00 linux-dgwd mcelog[2340]: Uncorrected error 2019-08-19T04:18:34.202143-04:00 linux-dgwd mcelog[2340]: Error enabled 2019-08-19T04:18:34.202220-04:00 linux-dgwd mcelog[2340]: MCi_MISC register valid 2019-08-19T04:18:34.202295-04:00 linux-dgwd mcelog[2340]: MCi_ADDR register valid 2019-08-19T04:18:34.202371-04:00 linux-dgwd mcelog[2340]: SRAR 2019-08-19T04:18:34.202449-04:00 linux-dgwd mcelog[2340]: MCA: Data CACHE Level-0 Data-Read Error 2019-08-19T04:18:34.202525-04:00 linux-dgwd mcelog[2340]: STATUS bd80000000100134 MCGSTATUS f 2019-08-19T04:18:34.202599-04:00 linux-dgwd mcelog[2340]: MCGCAP f000c14 APICID 6 SOCKETID 0 2019-08-19T04:18:34.202673-04:00 linux-dgwd mcelog[2340]: PPIN 2e01f1dd8b3e4923 2019-08-19T04:18:34.202748-04:00 linux-dgwd mcelog[2340]: MICROCODE 3000010 2019-08-19T04:18:34.202826-04:00 linux-dgwd mcelog[2340]: CPUID Vendor Intel Family 6 Model 85

From log, we know that we got two interrupts in a row when SRAR injected, the above one is from MCE (int 18) interrupt. EDAC doesn't decode mca info, because it is not a memory error (bit#7 is 0), and "mcelog --client" doesn't either. The below one is NMI interrupt that defined in GHES, there are EDAC decode output, and command "mcelog --client" did show up something.

2019-08-19T04:18:38.274652-04:00 linux-dgwd kernel: [ 261.273226] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 2019-08-19T04:18:38.274661-04:00 linux-dgwd kernel: [ 261.273227] {1}[Hardware Error]: event severity: recoverable 2019-08-19T04:18:38.274663-04:00 linux-dgwd kernel: [ 261.273228] {1}[Hardware Error]: Error 0, type: recoverable 2019-08-19T04:18:38.274664-04:00 linux-dgwd kernel: [ 261.273229] {1}[Hardware Error]: fru_text: Card01, ChnA, DIMM0 2019-08-19T04:18:38.274690-04:00 linux-dgwd kernel: [ 261.273229] {1}[Hardware Error]: section_type: memory error 2019-08-19T04:18:38.274692-04:00 linux-dgwd kernel: [ 261.273230] {1}[Hardware Error]: error_status: 0x0000000000000000 2019-08-19T04:18:38.274694-04:00 linux-dgwd kernel: [ 261.273230] {1}[Hardware Error]: physical_address: 0x0000000e16677000 2019-08-19T04:18:38.274695-04:00 linux-dgwd kernel: [ 261.273232] {1}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 2 device: 0 row: 56265 column: 696 2019-08-19T04:18:38.274698-04:00 linux-dgwd kernel: [ 261.273233] {1}[Hardware Error]: DIMM location: CPU 1 DIMM 5 2019-08-19T04:18:38.274699-04:00 linux-dgwd kernel: [ 261.273240] Memory failure: 0xe16677: already hardware poisoned 2019-08-19T04:18:38.274702-04:00 linux-dgwd kernel: [ 261.273244] mce: [Hardware Error]: Machine check events logged 2019-08-19T04:18:38.274705-04:00 linux-dgwd kernel: [ 261.273256] EDAC skx MC0: HANDLING MCE MEMORY ERROR 2019-08-19T04:18:38.274715-04:00 linux-dgwd kernel: [ 261.273262] EDAC skx MC0: CPU 4: Machine Check Event: 0 Bank 255: b40000000000009f 2019-08-19T04:18:38.274716-04:00 linux-dgwd kernel: [ 261.273262] EDAC skx MC0: TSC 0 2019-08-19T04:18:38.274718-04:00 linux-dgwd kernel: [ 261.273263] EDAC skx MC0: ADDR e16677000 2019-08-19T04:18:38.274719-04:00 linux-dgwd kernel: [ 261.273263] EDAC skx MC0: MISC 0 2019-08-19T04:18:38.274721-04:00 linux-dgwd kernel: [ 261.273264] EDAC skx MC0: PROCESSOR 0:50655 TIME 1566202718 SOCKET 0 APIC 8 2019-08-19T04:18:38.274722-04:00 linux-dgwd kernel: [ 261.273269] EDAC MC0: 1 UE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xe16677 offset:0x0 grain:32 - err_code:0000:009f socket:0 imc:0 rank:0 bg:0 ba:3 row:dad3 col:3c0) 2019-08-19T04:18:38.275994-04:00 linux-dgwd mcelog[2340]: Hardware event. This is not a software error. 2019-08-19T04:18:38.276274-04:00 linux-dgwd mcelog[2340]: MCE 0 2019-08-19T04:18:38.276500-04:00 linux-dgwd mcelog[2340]: CPU 4 ACPI/APEI reported error 2019-08-19T04:18:38.276722-04:00 linux-dgwd mcelog[2340]: ADDR e16677000 2019-08-19T04:18:38.276946-04:00 linux-dgwd mcelog[2340]: TIME 1566202718 Mon Aug 19 04:18:38 2019 2019-08-19T04:18:38.277172-04:00 linux-dgwd mcelog[2340]: MCG status: 2019-08-19T04:18:38.277393-04:00 linux-dgwd mcelog[2340]: MCi status: 2019-08-19T04:18:38.277612-04:00 linux-dgwd mcelog[2340]: Uncorrected error 2019-08-19T04:18:38.277831-04:00 linux-dgwd mcelog[2340]: Error enabled 2019-08-19T04:18:38.278048-04:00 linux-dgwd mcelog[2340]: MCi_ADDR register valid 2019-08-19T04:18:38.278267-04:00 linux-dgwd mcelog[2340]: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR 2019-08-19T04:18:38.278485-04:00 linux-dgwd mcelog[2340]: Transaction: Memory read error 2019-08-19T04:18:38.278709-04:00 linux-dgwd mcelog[2340]: STATUS b40000000000009f MCGSTATUS 0 2019-08-19T04:18:38.278928-04:00 linux-dgwd mcelog[2340]: MCGCAP f000c14 APICID 8 SOCKETID 0 2019-08-19T04:18:38.279147-04:00 linux-dgwd mcelog[2340]: PPIN 2e01f1dd8b3e4923 2019-08-19T04:18:38.279363-04:00 linux-dgwd mcelog[2340]: MICROCODE 3000010 2019-08-19T04:18:38.279580-04:00 linux-dgwd mcelog[2340]: CPUID Vendor Intel Family 6 Model 85 <<--

aegl commented 2 months ago

It is complicated.

When there is an error from memory (e.g. 2-bit ECC fail). The memory controller logs a UCNA signature in one of the machine check banks, signals CMCI, and returns “Poisoned” data to the requestor.

That poisoned data can be held in different levels of the cache. Sometimes it was just the result of a prefetch or speculative execution., in which case there is no further error action.

If a PCU core tries to consume data marked as poisoned, then the SRAR signature is logged and a machine check signaled. That happens in the L1 caches (instruction or data) so you end up with a “cache” coding in the MCi_STATUS.MCACOD field (0x0150 for instruction, 0x0134 for data).

-Tony

From: Xavier Lee @.> Sent: Friday, June 14, 2024 11:35 PM To: andikleen/mcelog @.> Cc: Luck, Tony @.>; Mention @.> Subject: Re: [andikleen/mcelog] mcelog --client output nothing when inject error to memory,But the messages can catch the error log. (#74)

@aeglhttps://github.com/aegl, Thanks for your clarification. Yes, is a fake mca bank number here, and just the CPER section in the Generic Error Status Block sent to OS. But I was just wondering in function apei_mce_report_mem_error(), if set the mca status register MCACOD to 0x009f, and then it notify the decoders that registered on "x86_mce_decoder_chain", include those EDAC notifier and mcelog notifier, so the EDAC and mcelog find that it is a valid memory error via bit#7 in mca status register and start to decode the mce information, but in fact, it is a cache error, such as SRAR DCU test in access application address, the output looks like below. The MCA status register is 0xbd80000000100134 is a "Data CACHE Level-0 Data-Read Error" but become 0xb40000000000009f via GHES notify OS, due to EDAC driver and mcelog be notified and start decoding to "MEMORY CONTROLLER RD_CHANNELunspecified_ERR" for mcelog, and "DAC MC0: 1 UE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xe16677 offset:0x0 grain:32 - err_code:0000:009f socket:0 imc:0 rank:0 bg:0 ba:3 row:dad3 col:3c0)" for EDAC. do those decode is right or not? Actually, it is a Cache Hierarchy Errors, but in the UEFI FFM scennario, kernel treat is as a Memory error, is that right again, do we need to improve it? Many thanks!

-->> 2019-08-19T04:17:35.088436-04:00 linux-dgwd mcelog: mcelog server already running 2019-08-19T04:17:35.096032-04:00 linux-dgwd kernel: [ 198.093202] EINJ: Error INJection is initialized. 2019-08-19T04:18:34.201005-04:00 linux-dgwd kernel: [ 257.199597] Disabling lock debugging due to kernel taint 2019-08-19T04:18:34.201256-04:00 linux-dgwd mcelog[2340]: Hardware event. This is not a software error. 2019-08-19T04:18:34.201302-04:00 linux-dgwd kernel: [ 257.199615] mce: Uncorrected hardware memory error in user-access at e16677000 2019-08-19T04:18:34.201304-04:00 linux-dgwd kernel: [ 257.199687] mce: [Hardware Error]: Machine check events logged 2019-08-19T04:18:34.201304-04:00 linux-dgwd kernel: [ 257.199882] Memory failure: 0xe16677: Killing mca-recover:4801 due to hardware memory corruption 2019-08-19T04:18:34.201305-04:00 linux-dgwd kernel: [ 257.199885] Memory failure: 0xe16677: recovery action for dirty LRU page: Recovered 2019-08-19T04:18:34.201505-04:00 linux-dgwd mcelog[2340]: MCE 0 2019-08-19T04:18:34.201595-04:00 linux-dgwd mcelog[2340]: CPU 3 BANK 1 TSC c76107cf84 2019-08-19T04:18:34.201677-04:00 linux-dgwd mcelog[2340]: RIP 33:404c36 2019-08-19T04:18:34.201753-04:00 linux-dgwd mcelog[2340]: MISC 86 ADDR e16677000 2019-08-19T04:18:34.201832-04:00 linux-dgwd mcelog[2340]: TIME 1566202714 Mon Aug 19 04:18:34 2019 2019-08-19T04:18:34.201913-04:00 linux-dgwd mcelog[2340]: MCG status:RIPV EIPV MCIP LMCE 2019-08-19T04:18:34.201990-04:00 linux-dgwd mcelog[2340]: MCi status: 2019-08-19T04:18:34.202066-04:00 linux-dgwd mcelog[2340]: Uncorrected error 2019-08-19T04:18:34.202143-04:00 linux-dgwd mcelog[2340]: Error enabled 2019-08-19T04:18:34.202220-04:00 linux-dgwd mcelog[2340]: MCi_MISC register valid 2019-08-19T04:18:34.202295-04:00 linux-dgwd mcelog[2340]: MCi_ADDR register valid 2019-08-19T04:18:34.202371-04:00 linux-dgwd mcelog[2340]: SRAR 2019-08-19T04:18:34.202449-04:00 linux-dgwd mcelog[2340]: MCA: Data CACHE Level-0 Data-Read Error 2019-08-19T04:18:34.202525-04:00 linux-dgwd mcelog[2340]: STATUS bd80000000100134 MCGSTATUS f 2019-08-19T04:18:34.202599-04:00 linux-dgwd mcelog[2340]: MCGCAP f000c14 APICID 6 SOCKETID 0 2019-08-19T04:18:34.202673-04:00 linux-dgwd mcelog[2340]: PPIN 2e01f1dd8b3e4923 2019-08-19T04:18:34.202748-04:00 linux-dgwd mcelog[2340]: MICROCODE 3000010 2019-08-19T04:18:34.202826-04:00 linux-dgwd mcelog[2340]: CPUID Vendor Intel Family 6 Model 85

From log, we know that we got two interrupts in a row when SRAR injected, the above one is from MCE (int 18) interrupt. EDAC doesn't decode mca info, because it is not a memory error (bit#7 is 0), and "mcelog --client" doesn't either. The below one is NMI interrupt that defined in GHES, there are EDAC decode output, and command "mcelog --client" show up something.

2019-08-19T04:18:38.274652-04:00 linux-dgwd kernel: [ 261.273226] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 2019-08-19T04:18:38.274661-04:00 linux-dgwd kernel: [ 261.273227] {1}[Hardware Error]: event severity: recoverable 2019-08-19T04:18:38.274663-04:00 linux-dgwd kernel: [ 261.273228] {1}[Hardware Error]: Error 0, type: recoverable 2019-08-19T04:18:38.274664-04:00 linux-dgwd kernel: [ 261.273229] {1}[Hardware Error]: fru_text: Card01, ChnA, DIMM0 2019-08-19T04:18:38.274690-04:00 linux-dgwd kernel: [ 261.273229] {1}[Hardware Error]: section_type: memory error 2019-08-19T04:18:38.274692-04:00 linux-dgwd kernel: [ 261.273230] {1}[Hardware Error]: error_status: 0x0000000000000000 2019-08-19T04:18:38.274694-04:00 linux-dgwd kernel: [ 261.273230] {1}[Hardware Error]: physical_address: 0x0000000e16677000 2019-08-19T04:18:38.274695-04:00 linux-dgwd kernel: [ 261.273232] {1}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 2 device: 0 row: 56265 column: 696 2019-08-19T04:18:38.274698-04:00 linux-dgwd kernel: [ 261.273233] {1}[Hardware Error]: DIMM location: CPU 1 DIMM 5 2019-08-19T04:18:38.274699-04:00 linux-dgwd kernel: [ 261.273240] Memory failure: 0xe16677: already hardware poisoned 2019-08-19T04:18:38.274702-04:00 linux-dgwd kernel: [ 261.273244] mce: [Hardware Error]: Machine check events logged 2019-08-19T04:18:38.274705-04:00 linux-dgwd kernel: [ 261.273256] EDAC skx MC0: HANDLING MCE MEMORY ERROR 2019-08-19T04:18:38.274715-04:00 linux-dgwd kernel: [ 261.273262] EDAC skx MC0: CPU 4: Machine Check Event: 0 Bank 255: b40000000000009f 2019-08-19T04:18:38.274716-04:00 linux-dgwd kernel: [ 261.273262] EDAC skx MC0: TSC 0 2019-08-19T04:18:38.274718-04:00 linux-dgwd kernel: [ 261.273263] EDAC skx MC0: ADDR e16677000 2019-08-19T04:18:38.274719-04:00 linux-dgwd kernel: [ 261.273263] EDAC skx MC0: MISC 0 2019-08-19T04:18:38.274721-04:00 linux-dgwd kernel: [ 261.273264] EDAC skx MC0: PROCESSOR 0:50655 TIME 1566202718 SOCKET 0 APIC 8 2019-08-19T04:18:38.274722-04:00 linux-dgwd kernel: [ 261.273269] EDAC MC0: 1 UE memory read error on CPU_SrcID#0_MC#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xe16677 offset:0x0 grain:32 - err_code:0000:009f socket:0 imc:0 rank:0 bg:0 ba:3 row:dad3 col:3c0) 2019-08-19T04:18:38.275994-04:00 linux-dgwd mcelog[2340]: Hardware event. This is not a software error. 2019-08-19T04:18:38.276274-04:00 linux-dgwd mcelog[2340]: MCE 0 2019-08-19T04:18:38.276500-04:00 linux-dgwd mcelog[2340]: CPU 4 ACPI/APEI reported error 2019-08-19T04:18:38.276722-04:00 linux-dgwd mcelog[2340]: ADDR e16677000 2019-08-19T04:18:38.276946-04:00 linux-dgwd mcelog[2340]: TIME 1566202718 Mon Aug 19 04:18:38 2019 2019-08-19T04:18:38.277172-04:00 linux-dgwd mcelog[2340]: MCG status: 2019-08-19T04:18:38.277393-04:00 linux-dgwd mcelog[2340]: MCi status: 2019-08-19T04:18:38.277612-04:00 linux-dgwd mcelog[2340]: Uncorrected error 2019-08-19T04:18:38.277831-04:00 linux-dgwd mcelog[2340]: Error enabled 2019-08-19T04:18:38.278048-04:00 linux-dgwd mcelog[2340]: MCi_ADDR register valid 2019-08-19T04:18:38.278267-04:00 linux-dgwd mcelog[2340]: MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR 2019-08-19T04:18:38.278485-04:00 linux-dgwd mcelog[2340]: Transaction: Memory read error 2019-08-19T04:18:38.278709-04:00 linux-dgwd mcelog[2340]: STATUS b40000000000009f MCGSTATUS 0 2019-08-19T04:18:38.278928-04:00 linux-dgwd mcelog[2340]: MCGCAP f000c14 APICID 8 SOCKETID 0 2019-08-19T04:18:38.279147-04:00 linux-dgwd mcelog[2340]: PPIN 2e01f1dd8b3e4923 2019-08-19T04:18:38.279363-04:00 linux-dgwd mcelog[2340]: MICROCODE 3000010 2019-08-19T04:18:38.279580-04:00 linux-dgwd mcelog[2340]: CPUID Vendor Intel Family 6 Model 85 <<--

— Reply to this email directly, view it on GitHubhttps://github.com/andikleen/mcelog/issues/74#issuecomment-2169161939, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABJRTBN6H5UXF7NA2GFAOYDZHPOBBAVCNFSM6AAAAABJIAJJW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRZGE3DCOJTHE. You are receiving this because you were mentioned.Message ID: @.**@.>>

xiaochunlee commented 2 months ago

Thanks for your patient explanation. so it sounds normal of EDAC decoder and "mcelog --client" show nothing in SRAR test case with no UEFI involved. Whereas, with UEFI help to collect data and notify OS via GHES table with a fake mca bank , the EDAC decoder and "mcelog --client" recorded data and output. it is normal as well. Is my understanding right? many thanks!