AER support for mTCA-EVR-300 device

hinxx commented 3 years ago

In the uTCA crates that we use a sporadic fatal errors may cause the root PCIe port to issue secondary bus reset. After the recovery process finishes the EVR card remains un-configured state; all PCI/PCIe registers in EVR at defaults, mostly 0. To recover, CPU needs to be restarted.

It was observed that, if AER handlers/callback are introduced to the kernel driver, the restoration of the PCI/PCIe register space returns the card back into operational state without any user intervention or CPU reboot. More details here https://www.spinics.net/lists/linux-pci/msg102935.html .

It is worth noting that in this case the EVR did not fall victim to the error being reported, it is hard to tell if it was the cause of the error, though. TBH, there are other more prominent actor in the crate that would be the originators of the issue that root port sees. IOW, nothing bad is happening to the EVR at that point in time, it is only that the bus reset causes its endpoints to go through reset and need to be re-configured afterwards.

Would upstream be willing to accept AER patches to the kernel driver?

mdavidsaver commented 3 years ago

It seems reasonable to try to handle PCI errors. I guess that this will mainly involve refactoring mrf_probe() to separate hardware from software initialization?

Do you have any idea how to test new fault handling code? I've found something, which links to a circa 2010 utility which doesn't seem to have much in the way of documentation.

hinxx commented 3 years ago

In my tests so far I have:

added struct pci_error_handlers with oversimplified callbacks,
added call to pci_save_state() to the end of mrf_proboe() so that stored state can be used for recovery in error handler(s),
made .resume() error callback call pci_restore_state() and pci_save_state() to restore the card state after bus reset.

Above will allow the driver/card to participate in the recovery process that would be initiated by the root port upon seeing error from any device on the downstream switch (that would connect EVR to one of its ports among other AMCs). Judging from the kernel code comments if a driver does not support AER it ruins the recovery for all the others on the bus.

It does nothing so far to fix any issues the EVR card itself might be having; in case it reports the error.

Do you have any idea how to test new fault handling code?

The aer-inject kernel module and userspace tool that you found still work in 2020. I used them to inject an error that causes the root port to issue bus reset for all my AMCs behind the MCH PCIe switch and the do recover. I can also wait until my uTCA setup generates an error by itself, but it is way easier to use the aer-inject.

I'll try to prepare a PR with what I've got.

hinxx commented 3 years ago

It might be that the PCIe core is not properly invoking error handlers as per https://lore.kernel.org/linux-pci/20201215185618.GC22809@redsun51.ssa.fujisawa.hgst.com/.

hinxx commented 3 years ago

I've been testing the https://lore.kernel.org/linux-pci/20210111163708.GA1458209@dhcp-10-100-145-180.wdc.com/T/#m0b23a618bd1b76c0babd34987788f5e2dfbbdd3d patchset and they result n the successful recovery of the PCI buses in my mTCA crate. That patchset fixes the AER handling such that the .slot_reset() callbacks on devices are made after PCI bus reset.

MRF kernel driver still needs the following:

struct pci_error_handlers with callbacks,
call to pci_save_state() to the end of mrf_probe() so that stored state can be used for recovery in error handler(s)

PR for MRF kernel driver follows.

mdavidsaver commented 3 years ago

To take a step back from error handling. The PCIe error your posts show is CmpltTO. It's too bad the kernel error messages don't show the TLP header of the failing operation. It would be interesting to know which endpoint/address is actually involved (and which direction).

From some quick reading, most references to CmpltTO errors involved DMA engines. This isn't surprising given how easy it is to get PCIe DMA wrong. I also see that you make reference to "Research Centre Juelich", which if memory serves me is used by various SIS fast digitizer cards, and are likely to be involved in DMA operations.

jerzyjamroz commented 6 months ago

It was confirmed by @hinxx that the PR closed this issue.

epics-modules / mrfioc2

AER support for mTCA-EVR-300 device #44