I noticed recently that with a very modern Dell Vostro laptop we have at work, chaos crashes completely on startup, which is kind of interesting. I debugged this a bit and concluded that it's actually the pci server that crashes when setting up the SMBus device.
I disabled this device in eab58d65e1f1f899494f6541ca87273403a2d54c, since "not detected" is much better than "crashing the machine". Someone with a strong love for the PCI hardware (...) would be very welcome to dig in and do a proper fix for this. I can volunteer to test any fix you make on this hardware (I have only seen it manifested one one single PC, ever.)
Steps taken to find this
While I debugged this, I first tried with this patch which will ignore certain PCI devices in the scanning and setup.
diff --git a/servers/system/pci/pci.c b/servers/system/pci/pci.c
index 53c1de8..7fe5210 100644
--- a/servers/system/pci/pci.c
+++ b/servers/system/pci/pci.c
@@ -528,7 +528,7 @@ static pci_device_type *pci_scan_slot(pci_device_type *input_device)
bool is_multi = FALSE;
uint8_t header_type;
- for (function = 0; function < 8; function++, input_device->device_function++)
+ for (function = 0; function < 4 /*8*/; function++, input_device->device_function++)
{
if (function != 0 && !is_multi)
{
This is just a thought, but maybe it's wrong to assume that all PCI hosts supports 8 functions per device and this is causing the problem? It could be that there is a flag that we could read somehow, that determines how many functions that should be scanned per device, and by not honoring that flag, we use the hardware incorrectly which it doesn't like and crashes in our face. Just a thought but maybe worth investigating.
Finding the failing device
I continued the investigation and, interestingly enough, it seems to be an SMBus device that doesn't like the way we probe its PCI slot:
I noticed recently that with a very modern Dell Vostro laptop we have at work, chaos crashes completely on startup, which is kind of interesting. I debugged this a bit and concluded that it's actually the pci server that crashes when setting up the SMBus device.
I disabled this device in eab58d65e1f1f899494f6541ca87273403a2d54c, since "not detected" is much better than "crashing the machine". Someone with a strong love for the PCI hardware (...) would be very welcome to dig in and do a proper fix for this. I can volunteer to test any fix you make on this hardware (I have only seen it manifested one one single PC, ever.)
Steps taken to find this
While I debugged this, I first tried with this patch which will ignore certain PCI devices in the scanning and setup.
(We could do like MINIX3 has done it (which was written after chaos had its peak years) and borrow the PCI scanning code from NetBSD instead of trying to write it on our own. Their implementation (the MINIX3 one, which is based on the NetBSD code) can be found here: https://github.com/Stichting-MINIX-Research-Foundation/minix/blame/master/sys/dev/pci/pci_subr.c)
This is just a thought, but maybe it's wrong to assume that all PCI hosts supports 8 functions per device and this is causing the problem? It could be that there is a flag that we could read somehow, that determines how many functions that should be scanned per device, and by not honoring that flag, we use the hardware incorrectly which it doesn't like and crashes in our face. Just a thought but maybe worth investigating.
Finding the failing device
I continued the investigation and, interestingly enough, it seems to be an SMBus device that doesn't like the way we probe its PCI slot:
The code above excludes this device/function from the scanning.
Does this SMBus device need to be probed in some special way or what's the deal here?
More details about the PCI subsystem on this machine
For reference, here is the full output of lspci: