Open pietrushnic opened 2 years ago
@miczyg1 please let me know what do you think.
@staticfloat can elaborate more, but basically what we're planning to do here is:
The primary reason for using custom firmware here, is that we don't really trust the system to be in any reasonable state after we do a CI run. We've regularly seen the PCIe root complex lock up due to bad transactions being generated by bad gateware on the FPGA device. Similarly, we don't really want to trust the integrity of the NVMe drive between reboots, since a stray PCIe transaction could have easily corrupted it. By using the SPI flash as the (immutable, hopefully, if we can turn on the lockdown bits before the kexec) source of truth for the system, hopefully all that can be avoided.
The other reason for custom firmware is that such debugging is of course highly sensitive to the exact configuration of the PCIe topology and we'd like this to be reproducible across setups, and we don't really trust the vendor BIOS to have any degree of consistency here between versions. Also, we may need to turn on or off custom BIOS options, and that's a lot easier to do if we control the setup.
That said, if I had a magic wand, there's a couple of nice-to-haves that I'm not planning to look into, but that I think would make the system much better.
Is it possible to just toggle the PCIe power domain to force a reset of the FPGAs on the PCIe cards? It's easy to get into a situation where the FPGA stops decoding PCIe entirely, so if that's your reconfiguration mechanism, there's not much you can do short of a power cycle. If we have full control over the firmware, it'd be interesting to see if we can reset the PCIe domain independently of the rest of the system to have faster cycles on this.
Can we use one of the system watchdogs to automatically reset the system if it gets really wedged? As I mentioned, we do regularly see the PCIe root complex get really wedged, so I'm not sure a software watchdog will do the trick properly. Can we expose the PCH watchdog (i.e. program it in firmware and have a driver in the OS that pets it every 30s or whatever). (Maybe coreboot already supports this out of the box - haven't looked into it, just writing down ideas).
What about measured boot? Currently, this is just for our open source experiments, so we don't really care about access control, but for people developing proprietary hardware, they'll probably want to gate CI enrollment on some sort of attestation.
Is there additional lockdown or hardening that could be done to protect the system against a rogue PCIe device? E.g. are there any NVMe drives that either have a firmware lockdown mode or have volatile firmware, so we can RoT them from the firmware. (Haven't looked into this either).
In general, I don't think I need the firmware to do very much here, except be exceptionally robust and have very quick cycle times. That's by itself an accomplishment of course ;). Hope this is useful. We'll keep pressing ahead with our setup, but happy to be guinea pigs if you want to come up with a maintained version of this :).
1 - This one is hard because as far that I know, the PCIe Slots are always powered, and I doubt that any consumer level board has any sort of controller that can remove power from a specific PCIe Slot then restore it again to force a card to power cycle. This also assumes that your card ONLY gets power from the PCIe Slot and isn't like a GPU that has auxiliary 6-Pin/8-Pin PCIe Power connectors, because that makes things harder as you likely want to remove power from all inputs at the same time. Thus, you most likely need some actual PCIe Hotplug hardware, which tends to be some form of interposer.
On enterprise servers that are intended for PCIe Hotplug, you have things like carrier modules, where you first gracefully disable the card Driver, then use a button on the carrier module to power off the card so that is safe to physically remove the module, then externally change the card inside the module, then reinsert the module on the server:
https://docs.oracle.com/cd/E24355_01/html/E41214/z400307d1578432.html#scrolltoc https://docs.oracle.com/cd/E24355_01/html/E41214/z40000082142688.html#scrolltoc https://docs.oracle.com/cd/E24355_01/html/E41214/cgghjehj.html#scrolltoc https://docs.oracle.com/cd/E24355_01/html/E41214/z4000d201391756.html#scrolltoc
You most likely want to hack a solution similar to that.
External boxes that could do the role of such carrier modules are used for eGPU purposes, but all the ones I know are limited to 4 PCIe lanes for the card and rely on ThunderBolt 3: https://egpu.io/best-egpu-buyers-guide/ Not sure if there is something similar with just standard PCIe. You could have some PCIe card adapter with an external OCuLink connector (4 lanes each) to connect such a box, and some button that removes all power.
Cheap interposers like that could include 16x-to-16x mining risers with separate power like this one: https://www.amazon.com/Express-Riser-Extender-Molex-Ribbon/dp/B00OTGJQ10 That allows you to have separate power cables for the power that comes from the PCIe Slot, then you need to pass these power cables though some controller that could be power off or on on user input (And also do the same for PCIe 6/8-Pins power, if any).
Either way, if you need an interposer you can't fit the card inside a regular computer case. That is literally the only way I can think about if you need a complete electrical power cycle. This could save you full computer reboots assuming that you get what you want working by just power cycling the card.
2 - I have no idea if there is Chipset or Super I/O watchdog and whenever it is actually enabled and can detect any freeze to force a power cycle.
3 - There is some form of support for Measured Boot on Dasharo for MSI: https://docs.dasharo.com/variants/msi_z690/test-matrix/ Not sure if it is enabled, working or what. But is was part of the plan.
4 - What you want is early/pre-boot IOMMU. Recall seeing several talks (Including a few from 3mdeb as part of Firmware security) regarding malicious or broken PCIe Devices and the need to have the IOMMU running as soon as possible to prevent one device from DMAing other devices. Not sure either about if this one is implemented on MSI. But the concept exists and has been theorycrafted.
Ad.1. I basically agree with @zirblazer . There is no way to power cycle a PCIe slot without proper design or external interposers/hardware. Modern laptops tend to implement PCIe RTD3 which is able to power down/up PCIe endpoint devices for the S0ix sleep, but that of course requires a hardware design supporting it. Ad.2. There are watchdogs on the PCH and Super I/O. Using the PCH watchdog should be definitely possible. Ad.3. Measured boot is enabled by default in our images as long as the platform has a TPM module. Ad.4. Early Vt-d/IOMMU protection is possible, Intel FSP has an option for that. There should be no problem with testing that. But the firmware should probably be aware of it (or one has to make firmware aware of it).
The problem you're addressing (if any)
There are use cases in Dasharo can help in testing peripheral devices (e.g. FPGA connected over PCIe, accelerators) as part of continues integration process for bigger applications. Such hardware is often used in HPC, medical, communication and low latency trading.
Because incorrect upload of gateware to FPGA or provisioning for target application may lead to malfunction of whole system Dasharo should provide support for as early as possible configuration of such peripheral devices. In other use cases we faced FPGA have to be available as early as possible and cannot wait for full system initialization, because of that there should be way discover connected PCIe devices as soon as possible initialize those to make them fully functional.
Describe the solution you'd like
Early initialization of peripheral devices and support for custom provisioning (bitstream/gateware upload and other operations).
Where is the value to a user, and who might that user be?
Speed up of full system initialization as well as early checks for correctness of device provisioning.
Describe alternatives you've considered
Maybe there is way for TrenchBoot DRTM live relaunch to avoid power cycle. This should improve trustworthiness of whole process.
Additional context
Above use case was mentioned by @Keno in this (nitter link) twitter thread.