NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

EAS1 Rabbit Parity Error after heavy Lustre IO job #30

Open ajfloeder opened 1 year ago

ajfloeder commented 1 year ago

Issue seen during one of a handful of IOR runs to a lustre ephemeral filesystem.

16:34:56 [ 5242.566919] EDAC PCI: Signaled System Error on 0000:0d:00.0
16:34:56 [ 5242.572795] EDAC PCI: Master Data Parity Error on 0000:0d:00.0
16:34:56 [ 5242.577377] EDAC PCI: Detected Parity Error on 0000:0d:00.0
16:34:56 [ 5242.581573] Kernel panic - not syncing: EDAC: PCI Parity Error
16:34:56 [ 5242.585351] CPU: 27 PID: 58520 Comm: kworker/u256:1 Kdump: loaded Tainted: P
16:34:56 ^[[23;80H        W  OE  X --------- -  - 4.18.0-425.10.1.1toss.t4.x86_64 #1
16:34:56 [ 5242.594470] Hardware name: HPE HPE_Cray_EX4nnn/HPE Cray EX4nnn, BIOS 0.1.1_wi
16:34:56 ^[[23;80Hith-setup-menu-access 11-09-2021
16:34:56 [ 5242.600730] Workqueue: edac-poller edac_pci_workq_function
16:34:56 [ 5242.604559] Call Trace:
16:34:56 [ 5242.606405]  dump_stack+0x41/0x60
16:34:56 [ 5242.608885]  panic+0xe7/0x2ac
16:34:56 [ 5242.611003]  edac_pci_do_parity_check.part.5.cold.7+0xc/0xc
16:34:56 [ 5242.614956]  edac_pci_workq_function+0x62/0x80
16:34:56 [ 5242.618047]  process_one_work+0x1ae/0x3a0
16:34:56 [ 5242.621363]  worker_thread+0x3c/0x3c0
16:34:56 [ 5242.624014]  ? create_worker+0x1a0/0x1a0
16:34:56 [ 5242.626833]  kthread+0x124/0x140
16:34:56 [ 5242.629251]  ? set_kthread_struct+0x50/0x50
16:34:56 [ 5242.632301]  ret_from_fork+0x35/0x40

Noteworthy is that new memory was recently installed in the machine.

ajfloeder commented 1 year ago

@behlendorf can you tell me what device 0000:0d:00.0 on the Rabbit? I believe lspci would show that information.

behlendorf commented 1 year ago
0d:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CD7 (prog-if 02 [NVM Express])
        Subsystem: KIOXIA Corporation Device 0110
        Physical Slot: 9-1
        Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        NUMA node: 3
        IOMMU group: 64
        Region 0: Memory at f9000000 (64-bit, non-prefetchable) [disabled] [size=32K]
        Capabilities: [80] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed unknown (downgraded), Width x0 (downgraded)
                        TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [d0] MSI-X: Enable+ Count=32 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00004000
        Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
        Kernel driver in use: nvme
        Kernel modules: nvme
ajfloeder commented 1 year ago

Similar kernel panic seen again:

2023-02-22 17:31:12 [16408.832385] EDAC PCI: Master Data Parity Error on 0000:06:00.0
2023-02-22 17:31:12 [16408.836974] EDAC PCI: Detected Parity Error on 0000:06:00.0
2023-02-22 17:31:12 [16408.844290] Kernel panic - not syncing: EDAC: PCI Parity Error
2023-02-22 17:31:12 [16408.847965] CPU: 62 PID: 463190 Comm: kworker/u256:1 Kdump: loaded Tainted: P
2023-02-22 17:31:12 ^[[23;80HP        W  OE  X --------- -  - 4.18.0-425.10.1.1toss.t4.x86_64 #1
2023-02-22 17:31:12 [16408.856029] Hardware name: HPE HPE_Cray_EX4nnn/HPE Cray EX4nnn, BIOS 0.1.1_wi
2023-02-22 17:31:12 ^[[23;80Hith-setup-menu-access 11-09-2021
2023-02-22 17:31:12 [16408.862621] Workqueue: edac-poller edac_pci_workq_function
2023-02-22 17:31:12 [16408.868494] Call Trace:
2023-02-22 17:31:12 [16408.870403]  dump_stack+0x41/0x60
2023-02-22 17:31:12 [16408.873407]  panic+0xe7/0x2ac
2023-02-22 17:31:12 [16408.876508]  edac_pci_do_parity_check.part.5.cold.7+0xc/0xc
2023-02-22 17:31:12 [16408.881147]  edac_pci_workq_function+0x62/0x80
2023-02-22 17:31:12 [16408.883979]  process_one_work+0x1ae/0x3a0
2023-02-22 17:31:12 [16408.886761]  worker_thread+0x3c/0x3c0
2023-02-22 17:31:12 [16408.889327]  ? create_worker+0x1a0/0x1a0
2023-02-22 17:31:12 [16408.891740]  kthread+0x124/0x140
2023-02-22 17:31:12 [16408.893929]  ? set_kthread_struct+0x50/0x50
2023-02-22 17:31:12 [16408.896922]  ret_from_fork+0x35/0x40
behlendorf commented 1 year ago

And again,

2023-02-23 20:23:35 [26111.179143] EDAC PCI: Signaled System Error on 0000:0b:00.0
2023-02-23 20:23:35 [26111.182950] EDAC PCI: Master Data Parity Error on 0000:0b:00.0
2023-02-23 20:23:35 [26111.187277] EDAC PCI: Detected Parity Error on 0000:0b:00.0
2023-02-23 20:23:35 [26111.192107] Kernel panic - not syncing: EDAC: PCI Parity Error
2023-02-23 20:23:35 [26111.193094] CPU: 65 PID: 1088687 Comm: kworker/u256:1 Kdump: loaded Tainted:
2023-02-23 20:23:35 ^[[23;80H P        W  OE  X --------- -  - 4.18.0-425.10.1.1toss.t4.x86_64 #1
2023-02-23 20:23:35 [26111.193094] Hardware name: HPE HPE_Cray_EX4nnn/HPE Cray EX4nnn, BIOS 0.1.1_wi
2023-02-23 20:23:35 ^[[23;80Hith-setup-menu-access 11-09-2021
2023-02-23 20:23:35 [26111.193094] Workqueue: edac-poller edac_pci_workq_function
2023-02-23 20:23:35 [26111.193094] Call Trace:
2023-02-23 20:23:35 [26111.193094]  dump_stack+0x41/0x60
2023-02-23 20:23:35 [26111.193094]  panic+0xe7/0x2ac
2023-02-23 20:23:35 [26111.193094]  edac_pci_do_parity_check.part.5.cold.7+0xc/0xc
2023-02-23 20:23:35 [26111.193094]  edac_pci_workq_function+0x62/0x80
2023-02-23 20:23:35 [26111.193094]  process_one_work+0x1ae/0x3a0
2023-02-23 20:23:35 [26111.193094]  worker_thread+0x3c/0x3c0
2023-02-23 20:23:35 [26111.193094]  ? create_worker+0x1a0/0x1a0
2023-02-23 20:23:35 [26111.193094]  kthread+0x124/0x140
2023-02-23 20:23:35 [26111.193094]  ? set_kthread_struct+0x50/0x50
2023-02-23 20:23:35 [26111.193094]  ret_from_fork+0x35/0x40