Open ajfloeder opened 1 year ago
@behlendorf can you tell me what device 0000:0d:00.0 on the Rabbit? I believe lspci would show that information.
0d:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CD7 (prog-if 02 [NVM Express])
Subsystem: KIOXIA Corporation Device 0110
Physical Slot: 9-1
Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
NUMA node: 3
IOMMU group: 64
Region 0: Memory at f9000000 (64-bit, non-prefetchable) [disabled] [size=32K]
Capabilities: [80] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM not supported
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed unknown (downgraded), Width x0 (downgraded)
TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [d0] MSI-X: Enable+ Count=32 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00004000
Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
Kernel driver in use: nvme
Kernel modules: nvme
Similar kernel panic seen again:
2023-02-22 17:31:12 [16408.832385] EDAC PCI: Master Data Parity Error on 0000:06:00.0
2023-02-22 17:31:12 [16408.836974] EDAC PCI: Detected Parity Error on 0000:06:00.0
2023-02-22 17:31:12 [16408.844290] Kernel panic - not syncing: EDAC: PCI Parity Error
2023-02-22 17:31:12 [16408.847965] CPU: 62 PID: 463190 Comm: kworker/u256:1 Kdump: loaded Tainted: P
2023-02-22 17:31:12 ^[[23;80HP W OE X --------- - - 4.18.0-425.10.1.1toss.t4.x86_64 #1
2023-02-22 17:31:12 [16408.856029] Hardware name: HPE HPE_Cray_EX4nnn/HPE Cray EX4nnn, BIOS 0.1.1_wi
2023-02-22 17:31:12 ^[[23;80Hith-setup-menu-access 11-09-2021
2023-02-22 17:31:12 [16408.862621] Workqueue: edac-poller edac_pci_workq_function
2023-02-22 17:31:12 [16408.868494] Call Trace:
2023-02-22 17:31:12 [16408.870403] dump_stack+0x41/0x60
2023-02-22 17:31:12 [16408.873407] panic+0xe7/0x2ac
2023-02-22 17:31:12 [16408.876508] edac_pci_do_parity_check.part.5.cold.7+0xc/0xc
2023-02-22 17:31:12 [16408.881147] edac_pci_workq_function+0x62/0x80
2023-02-22 17:31:12 [16408.883979] process_one_work+0x1ae/0x3a0
2023-02-22 17:31:12 [16408.886761] worker_thread+0x3c/0x3c0
2023-02-22 17:31:12 [16408.889327] ? create_worker+0x1a0/0x1a0
2023-02-22 17:31:12 [16408.891740] kthread+0x124/0x140
2023-02-22 17:31:12 [16408.893929] ? set_kthread_struct+0x50/0x50
2023-02-22 17:31:12 [16408.896922] ret_from_fork+0x35/0x40
And again,
2023-02-23 20:23:35 [26111.179143] EDAC PCI: Signaled System Error on 0000:0b:00.0
2023-02-23 20:23:35 [26111.182950] EDAC PCI: Master Data Parity Error on 0000:0b:00.0
2023-02-23 20:23:35 [26111.187277] EDAC PCI: Detected Parity Error on 0000:0b:00.0
2023-02-23 20:23:35 [26111.192107] Kernel panic - not syncing: EDAC: PCI Parity Error
2023-02-23 20:23:35 [26111.193094] CPU: 65 PID: 1088687 Comm: kworker/u256:1 Kdump: loaded Tainted:
2023-02-23 20:23:35 ^[[23;80H P W OE X --------- - - 4.18.0-425.10.1.1toss.t4.x86_64 #1
2023-02-23 20:23:35 [26111.193094] Hardware name: HPE HPE_Cray_EX4nnn/HPE Cray EX4nnn, BIOS 0.1.1_wi
2023-02-23 20:23:35 ^[[23;80Hith-setup-menu-access 11-09-2021
2023-02-23 20:23:35 [26111.193094] Workqueue: edac-poller edac_pci_workq_function
2023-02-23 20:23:35 [26111.193094] Call Trace:
2023-02-23 20:23:35 [26111.193094] dump_stack+0x41/0x60
2023-02-23 20:23:35 [26111.193094] panic+0xe7/0x2ac
2023-02-23 20:23:35 [26111.193094] edac_pci_do_parity_check.part.5.cold.7+0xc/0xc
2023-02-23 20:23:35 [26111.193094] edac_pci_workq_function+0x62/0x80
2023-02-23 20:23:35 [26111.193094] process_one_work+0x1ae/0x3a0
2023-02-23 20:23:35 [26111.193094] worker_thread+0x3c/0x3c0
2023-02-23 20:23:35 [26111.193094] ? create_worker+0x1a0/0x1a0
2023-02-23 20:23:35 [26111.193094] kthread+0x124/0x140
2023-02-23 20:23:35 [26111.193094] ? set_kthread_struct+0x50/0x50
2023-02-23 20:23:35 [26111.193094] ret_from_fork+0x35/0x40
Issue seen during one of a handful of IOR runs to a lustre ephemeral filesystem.
Noteworthy is that new memory was recently installed in the machine.