epics-modules / mrfioc2

EPICS driver for Micro Research Finland event timing system devices
http://epics-modules.github.io/mrfioc2/
Other
8 stars 30 forks source link

minimal AER handling functionality #48

Closed hinxx closed 3 years ago

hinxx commented 3 years ago

As per #44, this is a PR that introduces struct pci_error_handlers and its rudimentary callbacks that result in AER process to successfully restore the PCI register space of a mTCA MRF EVR after a bus reset was issued due to detection of a PCIe error.

Without handling the AER the PCI device makes the recovery process fail, even if all the other device on path to recovery can handle AER.

I do not have sufficient knowledge of the internals of the MRF kernel driver and its interaction with the card/firmware hence the mrf_error_detected() and mrf_slot_reset() might need to do a bit more than just disabling/enabling the PCI device.

This code was tested with the https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git on a mTCA form factor of EVR (mTCA-EVR-300).

mdavidsaver commented 3 years ago

This change looks straightforward enough. It looks like #ifdef CONFIG_PCIEAER won't be necessary at least with the x86 defconfig. I think this can be merged once the compile error wrt. pci_err() vs dev_err() is sorted out. If you're motivated to replace all usage of dev_err(), then great. Please do a second PR though.

Also, can you document the incantations you used to run the aer-inject utility? I'd like to make sure I can exercise these new code paths.

@jpietari Can you say if any additional re-initialization is needed after one of these PCIe error events?

hinxx commented 3 years ago

I seems to compile for all now.

I used this script inject-mrf.txt. Attached is also a sample output log.txt.

Please do a second PR though.

You want me to create another PR, still?

mdavidsaver commented 3 years ago

You want me to create another PR, still?

Only if you are/were eg. planning a larger/unrelated change like switching all usage of dev_err() to pci_err().

hinxx commented 3 years ago

I see. I had no intent to change that.

hinxx commented 3 years ago

Thank you for your input!

First, please let me know when/if you see a real (non-injected) error being handled.

I did see several of those on the system I have. I'll keep an eye on the logs to spot one, record it and share with you.

pci_enable_pcie_error_reporting(). Does this need to be called? (for PCIe only?)

It is PCie only, AFAICT, yes. If enabled it allows the AMC to report the errors to the root complex. Without the call to the pci_enable_pcie_error_reporting() AMC driver will not report error if one would be detected on its part. It is complementary to the AER driver recovery capability, IMHO.

I just tried adding the call to the MRF driver; upon injecting the error, the log shows:

mrf-pci 0000:09:00.0: aer_inject: Device doesn't support AER

For the record I'm working with an EVR:

Manufacturer(25)         : Micro-Research Finland Oy                                             
Product Name(21)         : mTCA.4 Event Receiver                                                 
Product Number(12)       : mTCA-EVR-300                                                          
Part Version(04)         : v1.1                                                                  
Product Serial Number(07): P376060                                                               

As I understand it, the Xilinx firmware PCIe IP core would need to have the AER feature enabled in order to have EVR report the errors.

What this PR does is allowing the EVR to recover from the PCI bus reset that wipes the device register space (as you can tell the crux of the PR is restoring/saving of the PCI registers). One could argue that this is not about AER. In a way that is true, as the error recovery seems to be possible for non-PCIe devices as well, as per https://www.kernel.org/doc/Documentation/PCI/pci-error-recovery.rst.

in e1000e/netdev.c the error_detected callback is disabling MSI

I can surely add that pci_disable_msi() and pci_enable_msi() in the recovery path as well. Let me test that with the EVR at hand before amending the PR. In regard to PM, I never looked at that part myself. Something to keep in mind during the testing!

If this happens lspci -v will show Control: ... BusMaster-.

I see. Thanks for pointing out. I've looked at the lspci -v output before/after the AER recovery and they both show:

[dev@bd-cpu18 ~]$ sudo lspci -s 09:00.0 -vv
09:00.0 Signal processing controller: Xilinx Corporation Device 7011
    Subsystem: Device 1a3e:132c
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+

The PCI_COMMAND register is set as part of the restore process. OTOH, calling the pci_set_master() would not hurt and it would be consistent with other drivers you point out.

I wasn't able to make the QEMU pcie_aer_inject_error command work

I never used QEMU myself. Everything I've done for this PR was on the real uTCA system with the CCT AM G64/471 CPU AMC that has the PCIe root complex capable of handling the AER (the older CCT CPU from 2012 does not have this ability, though). I'll try to play around with the QEMU if time permits.

After we settle on the changes I hope to see the driver running on a larger pool of uTCA systems here at ESS, before merging this into master branch. I think our controls group also has some PCIe EVM/EVG cards that could be tested.

A question from my side. This driver supports the EVR/EVG(?) that come in non-PCIe form, right? If so, would there be a need to #ifdef the code I'm introducing?

hinxx commented 3 years ago

I've tested the latest PR code on the EVR hardware by starting the IOC and expecting to see events count up. The events do count up after the recovery. It goes without saying that the IOC had to be restarted after the bus was reset and UIO device unregistered.

It appears that before calling pci_disable_msi() the IRQ line needs to be freed. For UIO device this done in the uio_unregister_device() hence the need to remove and re-add the UIO device. See comment at https://elixir.bootlin.com/linux/latest/source/drivers/uio/uio.c#L970.

The call to pci_aer_clear_nonfatal_status() has been scrapped since the AER capability / registers are not present on EVR.

mdavidsaver commented 3 years ago

It goes without saying that the IOC had to be restarted after the bus was reset and UIO device unregistered.

This isn't a good thing to leave unsaid. I also wonder what happens to the IO memory mappings while the slot is being reset?

The mrfioc2 epics driver shadows some registers, and will (eventually) behave in undesirable ways if register writes or reads don't go through (this is what uncorrectable means wrt. PCI errors). If this happens, faulting the userspace process seems like the strictly correct (though inconvenient) way to go. Though I'm not clear on how this should be done. I've given some thought to this in the context of hot swap, but didn't reach any conclusions. Mainly, I couldn't find an existing driver supporting hot swap, and exposing MMIO to userspace.

mdavidsaver commented 3 years ago

AER capability / registers are not present on EVR.

Right. I wasn't paying enough attention to your inject-mrf script which is running aer_inject on (I think) the upstream port of a PCIe switch.

mdavidsaver commented 3 years ago

I've been wrestling with the fact that I just don't know enough to fully review this change. I'm left with lingering doubts about what happens to actively mmap()'d ranges following an unrecoverable error*. I think I would need to take a deep dive into kernel internals to figure this out to my satisfaction, which I don't see happening in the foreseeable future.

Rather than continue this unproductive/indefinite delay, I'm going to go ahead and merge this change on the theory that it is unlikely to led to regressions.