Open ajfloeder opened 1 year ago
Observed again this time with using xfs and a copy_in
workflow.
Waiting for a new debug BIOS to be built.
When hitting the issue with the the debug BIOS. Partial log.
pcieport 0000:90:05.1: DPC: containment event, status:0x1f01 source:0x0000
pcieport 0000:90:05.1: DPC: unmasked uncorrectable error detected
pcieport 0000:90:05.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
pcieport 0000:90:05.1: device [1022:1483] error status/mask=00004000/04000000
pcieport 0000:90:05.1: [14] CmpltTO (First)
Waiting for a new debug BIOS to be built.
New debug BIOS was built and installed prior to the latest occurrence yesterday. The debug BIOS disables platform handling of errors and passes them to the OS for handling.
@behlendorf a couple of questions:
What are the steps required to reproduce this error?
We've now been able to reproduce this by loading up the system with single compute node lustre (or xfs) workflows and letting flux schedule them to the available compute nodes. For example, #DW jobdw capacity=1TiB type=xfs name=test1
. We've also seen this without the copy_in
directive being specified.
How many different compute nodes has this been seen on?
Both of our BP nodes:
console.hetchy28-20230224.gz:2023-02-23 15:48:26 [96572.596028] pcieport 0000:90:05.1: PCIe Bus Error: severity=Uncorrected (Fata
console.hetchy29-20230223.gz:2023-02-22 15:42:46 [ 9837.313218] pcieport 0000:90:05.1: PCIe Bus Error: severity=Uncorrected (Fata
If more than 1, have the compute nodes been moved at all?
No.
The issue was seen again 3/27
Snip from the log at the point of failure where downstream port containment (DPC) engages:
2023-03-24 14:12:01 [805811.477503] XFS (dm-1): Ending clean mount 2023-03-24 14:12:02 [805812.342919] igb 0000:94:00.0 ens2: PCIe link lost 2023-03-24 14:12:02 [805812.342945] pcieport 0000:90:05.1: DPC: containment event, status:0x1f01 sou 2023-03-24 14:12:02 ^[[23;80Hurce:0x0000 2023-03-24 14:12:02 [805812.353069] pcieport 0000:90:05.1: DPC: unmasked uncorrectable error detecte 2023-03-24 14:12:02 ^[[23;80Hed^H 2023-03-24 14:12:02 [805812.358049] pcieport 0000:90:05.1: PCIe Bus Error: severity=Uncorrected (Fat 2023-03-24 14:12:02 ^[[23;80Htal), type=Transaction Layer, (Requester ID) 2023-03-24 14:12:02 [805812.366555] pcieport 0000:90:05.1: device [1022:1483] error status/mask=00 2023-03-24 14:12:02 ^[[23;80H0004000/04000000 2023-03-24 14:12:02 [805812.371508] pcieport 0000:90:05.1: [14] CmpltTO (First)
Sporadically, the compute node can lock up with nothing in the console logs for the node. It is unresponsive and requires a power cycle.
Need to look into hardware error handling setting in the BIOS to allow error to propagate to the OS rather than be trapped by the platform.