Compute node receives PCIe timeout

NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/

Apache License 2.0

3 stars 3 forks source link

Compute node receives PCIe timeout #31

Open ajfloeder opened 1 year ago

ajfloeder commented 1 year ago

Sporadically, the compute node can lock up with nothing in the console logs for the node. It is unresponsive and requires a power cycle.

Need to look into hardware error handling setting in the BIOS to allow error to propagate to the OS rather than be trapped by the platform.

behlendorf commented 1 year ago

Observed again this time with using xfs and a copy_in workflow.

ajfloeder commented 1 year ago

Waiting for a new debug BIOS to be built.

behlendorf commented 1 year ago

When hitting the issue with the the debug BIOS. Partial log.

pcieport 0000:90:05.1: DPC: containment event, status:0x1f01 source:0x0000 
pcieport 0000:90:05.1: DPC: unmasked uncorrectable error detected
pcieport 0000:90:05.1: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
pcieport 0000:90:05.1:   device [1022:1483] error status/mask=00004000/04000000
pcieport 0000:90:05.1:    [14] CmpltTO                (First)

ajfloeder commented 1 year ago

Waiting for a new debug BIOS to be built.

New debug BIOS was built and installed prior to the latest occurrence yesterday. The debug BIOS disables platform handling of errors and passes them to the OS for handling.

ajfloeder commented 1 year ago

@behlendorf a couple of questions:

What are the steps required to reproduce this error?
How many different compute nodes has this been seen on?
If more than 1, have the compute nodes been moved at all?

behlendorf commented 1 year ago

What are the steps required to reproduce this error?

We've now been able to reproduce this by loading up the system with single compute node lustre (or xfs) workflows and letting flux schedule them to the available compute nodes. For example, #DW jobdw capacity=1TiB type=xfs name=test1. We've also seen this without the copy_in directive being specified.

How many different compute nodes has this been seen on?

Both of our BP nodes:

console.hetchy28-20230224.gz:2023-02-23 15:48:26 [96572.596028] pcieport 0000:90:05.1: PCIe Bus Error: severity=Uncorrected (Fata
console.hetchy29-20230223.gz:2023-02-22 15:42:46 [ 9837.313218] pcieport 0000:90:05.1: PCIe Bus Error: severity=Uncorrected (Fata

If more than 1, have the compute nodes been moved at all?

No.

ajfloeder commented 1 year ago

The issue was seen again 3/27

Snip from the log at the point of failure where downstream port containment (DPC) engages:

2023-03-24 14:12:01 [805811.477503] XFS (dm-1): Ending clean mount 2023-03-24 14:12:02 [805812.342919] igb 0000:94:00.0 ens2: PCIe link lost 2023-03-24 14:12:02 [805812.342945] pcieport 0000:90:05.1: DPC: containment event, status:0x1f01 sou 2023-03-24 14:12:02 ^[[23;80Hurce:0x0000 2023-03-24 14:12:02 [805812.353069] pcieport 0000:90:05.1: DPC: unmasked uncorrectable error detecte 2023-03-24 14:12:02 ^[[23;80Hed^H 2023-03-24 14:12:02 [805812.358049] pcieport 0000:90:05.1: PCIe Bus Error: severity=Uncorrected (Fat 2023-03-24 14:12:02 ^[[23;80Htal), type=Transaction Layer, (Requester ID) 2023-03-24 14:12:02 [805812.366555] pcieport 0000:90:05.1: device [1022:1483] error status/mask=00 2023-03-24 14:12:02 ^[[23;80H0004000/04000000 2023-03-24 14:12:02 [805812.371508] pcieport 0000:90:05.1: [14] CmpltTO (First)