Dasharo / dasharo-issues

The Dasharo issue tracker
https://dasharo.com/
25 stars 0 forks source link

Reliable and automatic GPU HCL report #462

Open pietrushnic opened 1 year ago

pietrushnic commented 1 year ago

The problem you're addressing (if any)

There is no automatic way to obtain reliable information about GPU HCL. More than the amount of information dumped by Dasharo HCL is needed.

The topic of GPU identification is quite complex—what can be seen in the following Phoronix Test Suite function.

Temporarily as part of this effort, we will remove GPU compatibility since we can only rely on reports with physical evidence and maintain lists manually.

Describe the solution you'd like

Ideally, we would reimplement the Phoronix identification code from the linked function or the others. Or even better, use PTS, but just that part we need for GPU identification.

There may be other options which we need to learn about.

Where is the value to a user, and who might that user be?

GPU compatibility is one of the most important and searched topics. It can be a driving factor in switching to Dasharo.

Describe alternatives you've considered

No response

Additional context

No response

zirblazer commented 1 year ago

I actually believe than making a useful, good Video Card HCL requires A LOT of manual work and can't really be automatized to any meaningful degree because there are just way too many stages where the GPU may not function the way you expect it to, up to the point that you would end up with a bad quality list where things just "partially works". And that info is almost worthless. For example, older Dasharo versions for the MSI Z690-A had the major bug of not initializing the IGP of certain Processors ( https://github.com/Dasharo/dasharo-issues/issues/274 ), which went completely unnoticed because for users that flashed Dasharo with those Processors, if Dasharo could automatically pick a boot device without having to go to the Firmware setup menu to configure boot order, these users would believe than Dasharo booted straight to Linux because they got video output working after Linux Intel GPU Drivers took over, whereas for other people it just got stuck on the UEFI Shell and it appeared as if the Motherboard was bricked when in truth Dasharo was properly flashed and working, but there was no video. So for all practical purposes, for some people the IGP worked since it was fully functional in the OS, except for Firmware stages, where it did not. Thus it "partially works". From an user perspective, more so for one that is doing the novelty thing of trying a third party Firmware, the most critical moment where I'm expecting the Video Card to work is precisely at Firmware POST time, and info about that seems to be hard to collect beyond cbmem successfully detecting the Video Card as a PCI Device, reporting than it managed to load its Option ROM, and detecting a Monitor plugged to it. So you likely need manual confirmation than for the user, the card displays the Dasharo splash screen on the Monitor and you can enter and interact Firmware setup with no performance anomalies or visual glitches, which is pretty much the most important thing. You can't really automatize this, and without this info, a HCL doesn't means much unless you specify WHAT is working.

So far, there are three "stages" where a Video Card should be validated to work: 1 - At Firmware POST time, so you get Dasharo splash screen and visual access to the Firmware Setup menu 2 - After OS loads the GPU Drivers. This varies also whenever testing on Linux and Windows, of course 3 - For PCI Passthrough with Xen (Qubes) or QEMU-KVM-VFIO

Firmware POST time is the most important. As stated before, manual confirmation that it works is unavoidable. The only useful information that may be automatically gathered is from cbmem, which requires Dasharo to be flashed first. At the bare minimum, the main two requeriments to get the Firmware a working video output are: a - The Video Card Option ROM has to have an UEFI GOP image, and nearly all cards from the last decade (With at least a curious exception, the Radeon VII with release VBIOS: https://www.techpowerup.com/forums/threads/amd-radeon-vii-has-no-uefi-support.252476/ ). Video Cards with just older VBIOS images are not expected to work during Firmware because it is UEFI only, but OS may be able to initialize these anyways, so it may "partially work" anyways. b - If Secure Boot is enabled (Newer Dasharo versions has it disabled by default), the Option ROM has to be signed, most likely with the Microsoft Third Party UEFI certificate. Early cards with UEFI GOP support are not signed, but this should not a problem for any modern card, either. This may potentially stop POST and freeze the system.

I recall four different scenarios where there were issues with Video Cards during Firmware:

First, I had issues with performance slowdowns during POST with a Radeon 5600XT installed that were tracked down to an USB Flash Drive being inserted on the back column with 4 USB Ports, which seems to be problematic, but the slowdowns were fixed by merely changing the flash drive to another USB Port. This one required cbmem logs with the card installed and without before miczyg pointed out the USB going bonkers issue, which was ironically NOT directly related to the Video Card yet only happens when I have a Monitor plugged to it, since if I recall correctly (This happened a year ago already), if the Monitor was on the IGP but the Radeon 5600XT was installed, it didn't happened. So an issue with USB manifested as a Video Card POST perfomance issue instead.

You also have the Radeon Polaris series (RX 470 / 480 / 570 / 580 ) issue with the early VT-d driver implemented and enabled by default in Dasharo v1.1.0 that caused a white screen during POST, whereas it worked fine in v1.0.0 (And v1.1.1, which has the Driver disabled by default. But it will become a problem if you get to implement Thunderbolt...). This one was detected by cbmem logs reporting the early VT-d driver DMA violations.

The third one was miczyg Radeon 6600XT not working during POST, which he fixed for v1.1.0. I don't recall if cbmem gave him the details nor most of how the issue was figured out, but it was a miracle discovery about how the Option ROM had to be loaded.

The fourth one was an user with a GeForce 3060 Ti, which didn't worked on v1.0.0 until the user decided to test with Secure Boot disabled (Dasharo came with Secure Boot enabled out of the box, then disabled since v1.1.0 precisely due to this). He dissambled the system, removed the card, noticed the system was actually working without it, then decided to disable Secure Boot, then put the card again, and found that it works that way. It may be possible than the user was using a modded VBIOS or something that wasn't signed and that is why Secure Boot being enabled stopped POSTing, but at that point you actually need to analyze the Option ROM of that specific Video Card. It may be also possible than the card Option ROM wasn't signed (Similar to the Radeon VII issue), but it seems unlikely with a modern card.

Moreover, there are scenarios where the Video Card itself can cause a fail during POST (Either not initializing until OS stage, or freezing the system), depending on the Monitor plugged. Example is GeForce 900 and 1000 cards with older VBIOSes and certain DisplayPort Monitors: https://www.techpowerup.com/244981/nvidia-has-a-displayport-problem-which-only-a-bios-update-can-fix

This means than you have WAY TOO MANY SCENARIOS already. At this point, a good Video Card HCL should include not only GPU, but Video Card maker and model and Option ROM version that was tested and inputs that had Monitors plugged in, because results may vary depending on that. You may consider than this is too exhaustive, but if you want to reproduce an issue or understand why it happens, you need this.

After OS loads is easy, because that is where you can automatize the most and even get data with propietary Firmware and with Dasharo for comparisons, even though results may vary depending on a multitude of factors, which may include whenever the Video Card is the only one installed or there are multiple Video Cards, which could cause MMIO issues, like with my dual Radeon 5600XT setup ( https://github.com/Dasharo/dasharo-issues/issues/245 ). Actually, MultiGPU results should be completely separated due to adding another massive layer of complexity, more so if you were using different cards (Mixing nVidia and AMD) instead of identical. For all practical purposes, you may likely want that Video Cards are tested on the main PCIe 5.0 16x slot with no other Video Card installed (Or better yet, with no other PCIe card). Compatibility results may vary between Linux and Windows (Again, see my dual 5600XT issues, Linux worked, Windows BSODed). Information like size of PCI BARs can also be useful to known if things like ReBAR works. But that adds YET ANOTHER layer of complexity: Results may be different in either Below 4G MMIO, Above 4G MMIO, and Above 4G MMIO + ReBAR. Moreover, with AMD Video Cards, the amdgpu driver seems to want to enable ReBAR even when the Firmware doesn't enable it, so you may need to figure out what Kernel parameters you have to force to make Linux and the amdgpu Driver to honor Firmware MMIO allocations instead of reallocating everything to fit ReBAR, since this ironically was hiding the issue that made Windows to BSOD with both cards installed (Some issues were reproducible in older Linux versions, before amdgpu began to behave like that).

And finally, PCI Passthrugh. GPUs behave differently from a full reset compared to a warm reset like when you close a Virtual Machine and restart it. This means than a potential HCL list aiming for compatibility with Qubes needs even MORE testing. As a particular example in QEMU-KVM-VFIO, some AMD Video Cards had reset issues (They only worked properly the first time the VM was started, could freeze the system if rebooting the VM or any other funny behavior). In other cases, the cards don't work in VM if ReBAR is enabled in Firmware (Because amdgpu Drivers just resize a single BAR, whereas Firmware maxes out two BARs that are marked as resizeable).

There, you have it. A Video Card HCL is a complete nightmare. Almost everything should merely "partially work", since it may or may not work depending on Firmware settings, OS configuration, and use case (Native bare metal vs PCI Passthrough for virtualization, and I bet that Xen vs QEMU-KVM-VFIO also makes differences since the VFIO maintainer at some point added quirks to help reset some Video Cards and such).

Now, here is what I propose...

Video Card has to be directly installed (No PCIe Risers or anything funny) on the main, metal PCIe 5.0 16x Slot, and there shouldn't be any other Video Card except Intel integrated one (Also note than Dasharo currently doesn't support disabling the IGP, so F and non-F CPUs may behave differently, but you should theorically be able to simulate non-F behavior if you were able to disable the IGP). This can be checked from lspci -t since the GPU should be sitting in a tree spawning from the same PCIe Rort Port in every case. Confirmation than Dasharo splash screen and Firmware menues were working, and there doesn't seem to be any performance anomalies or visual glitches (Artifacts). This has to be done fully manually. This MAY also involved Monitor related bugs, so Monitor plugged to the card matters, and Firmware settings like Secure Boot, as mentioned previously. Maybe a confirmation than Linux and/or Windows works by testing a game, or an automatic testing suit, whatever. Since currently you don't support neither Above 4G MMIO nor ReBAR, you don't have to test a whole lot of combinations. amdgpu not honoring Firmware MMIO allocation is the thing that annoys me the most since things could be broken Firmware side but workarounded OS side, but most people are going to use default configuration anyways...

Actual GPU chip (PCI Vendor ID / Device ID) Video Card model (Subsystem Vendor ID / Device ID) Option ROM version Firmware settings (Secure Boot Enabled/Disabled, Early VT-d Enabled/Disabled. Ideally, the worst scenario are both Enabled)

Here is a small possible example:

VIDEO CARD Radeon RX 5600XT (MSI GAMING X) - https://www.msi.com/Graphics-Card/Radeon-RX-5600-XT-GAMING-X/Specification Manufacturer Part Number (If available) - ? (Is on the card, but I forgot where I wrote it) Option ROM - https://www.techpowerup.com/vgabios/220715/msi-rx5600xt-6144-200405 PCI Vendor ID / Device ID - 1002 731F PCI Subsystem Vendor ID / Subsystem Device ID - 1462 C810

FIRMWARE SETTINGS Secure Boot - Enabled Early VT-d Driver - Enabled

Works in Firmware, boots Ubuntu 22.04.1, Windows 11 with Drivers Adrenaline blah blah blah.

Also, it seems that some vendors use Subsystem Device ID for different cards models whereas others do not. For example, according to VBIOS dumps from TechPowerUp VGA BIOS Collection, Gigabyte has at least two Radeon 6800XT for two different lines, AORUS Master and Gaming OC, and they use different Subsystem Device ID:

Gigabyte RX 6800 XT 16 GB BIOS (AORUS Master) Device Id: 1002 73BF / Subsystem Id: 1458 232A

Gigabyte RX 6800 XT 16 GB BIOS (Gaming OC) Device Id: 1002 73BF / Subsystem Id: 1458 2328

MSI instead seems to have different header on the VBIOS.

MSI RX 6800 XT 16 GB BIOS (Gaming X Trio) - https://www.techpowerup.com/vgabios/231128/msi-rx6800xt-16384-210112 VBIOS Version: 020.001.000.049.000000 113-V395TRIO-1OC Device Id: 1002 73BF / Subsystem Id: 1462 3951

MSI RX 6800 XT 16 GB BIOS (Gaming X Trio) - https://www.techpowerup.com/vgabios/230851/msi-rx6800xt-16384-201216 VBIOS Version: 020.001.000.049.000000 113-V395TRIO-1OC Device Id: 1002 73BF / Subsystem Id: 1462 3951

MSI RX 6800 XT 16 GB BIOS (Gaming X Trio) - https://www.techpowerup.com/vgabios/230852/msi-rx6800xt-16384-201124 VBIOS Version: 020.001.000.045.000000 GAMINGX Device Id: 1002 73BF / Subsystem Id: 1462 3951

The first two seems identical except in checksum. The third one seems to belong to another card series based on header...

WiktorG351 commented 2 months ago

I've explored the possibility of using the Phoronix Test Suite to validate GPU compatibility. Initially, I added the entire PTS to DTS and tried running some GPU tests, but several issues arose:

However, I did find that the phoronix-test-suite diagnostics command can provide some useful GPU information, such as device ID, video card details, and monitor configuration.

Given the complexity highlighted in the discussion, particularly regarding manual validation and the diverse scenarios where GPUs might "partially work," automating this process seems very challenging, although PTS itself seems like a promising resource.

macpijan commented 2 months ago

There isn't enough memory for the tests.

Can you elaborate here? Do we know how much memory is needed? How much was in the test unit?

Any idea which distro is typically used for running PTS? Clearly integrating it into DTS is not the best idea based on your input.

pietrushnic commented 1 month ago

The problem here is we would like to extract that information based on HCL. It seems that the suggestion is to add PTS to DTS to gather this information, but the key issue is how to extract that from HCLs.

pietrushnic commented 1 month ago

We should check how those other frameworks extract that information, like PTS or hw-probe, and then use that to extract information.

For now I have no other choice than trust lspci output and looking for VGA compatible controller.