TrenchBoot / trenchboot-issues

This repository is to centralize issues and development progress tracking for the TrenchBoot project.
3 stars 1 forks source link

AMD hardware selection #24

Closed BeataZdunczyk closed 3 months ago

BeataZdunczyk commented 7 months ago

Affected component(s) or functionality (if applicable)

Phase 4 described here: Phase 4 - AMD support for Qubes OS AEM with TrenchBoot. Tracking here: https://github.com/TrenchBoot/trenchboot-issues/milestone/4

Brief summary

To successfully complete phase 4 of the AEM QubesOS project, it is crucial to identify suitable AMD processor-based platforms for development and testing. Specifically, one of the selected platforms should support the execution of the https://github.com/TrenchBoot/trenchboot-issues/issues/20 task.

Version

Additional context

Relevant documentation you've consulted

Related, non-duplicate issues

krystian-hebel commented 7 months ago

For testing we will be using Supermicro M11SDV-4C-LN4F platform. It has UEFI capable of CSM booting, and plugging different TPM flavor shouldn't be too big of a problem, so all of the mentioned use cases can be covered on this platform.

It has BMC with iKVM that works reliably in setup menu, GRUB and Qubes OS. It doesn't even require workaround for tablet that we needed for PiKVM. The only small issue is that when platform is started without external monitor connected, the highest possible resolution is 1024x768. With monitor connected, 1600x1200 worked without any hiccups. We hope that this can be dealt with by using dummy VGA EDID adapter.

After setting SOL to use 0x3f8 instead of default 0x2f8 serial can be easily read. It was a nice surprise, because on another Supermicro board, X11, the output was heavily modified by BMC even on physical port.

We won't be able to directly mount Qubes OS installer ISO due to the size limitation (4.7 GB), so in the worst case we will have to plug in USB stick. On the flip side, this would be identical to how users install it.

Unfortunately, this platform doesn't cover multi-node CPU requirement. For that we will probably use ASUS KGPE D-16, but we have to do some tests before we can be sure.

SergiiDmytruk commented 6 months ago

Using BMC for remote control

ipmi backend of openQA mostly just invokes ipmitool, meaning that we could use it in generalhw manually. With the tool installed, openQA can power on/off the machine, but that's about it (no serial or video).

That backend seems to only support reading from SoL and creates no VNC consoles, which can be addressed by modifying it if there was a chance that it would work. https://github.com/thefloweringash/chicken-aten-ikvm?tab=readme-ov-file#protocol provides an overview of protocol differences to normal VNC, https://github.com/kelleyk/noVNC/tree/ast2100-support didn't work for me (also tried bmc-support and bmc-support-old branches), either way it would have to be reimplemented in os-autoinst to be useful because of different authorization and encoding methods.

I did try to poke it on 5900 and 443 ports (the latter via WebSockets as used by JS scripts), but apparently the board falls into the category of those which don't respond to connection while normally server sends RFB 003.008 or alike right away. Comments in noVNC fork mentions such boards which is why I tried with WebSocket, but that might be bound to session cookie. I looked at JS scripts by BMC (they are either readable or just unformatted, no obfuscation) and it doesn't seem to do anything unusual, so could be cookie or request origin. Not sure how to fake those for a test. Even if it worked, that would be one more thing to do (login via curl and extract cookie) along with reimplementing parsing in Perl. At least some of their encodings seems to be quite complicated.

So this is probably not happening, unless there is an easier way to get it working.

Using PiKVM for remote control

I saw firmware mentioning network boot on startup, so could use that to boot into installer.

1024x768 resolution is exactly what openQA tests need.

Unclear whether https://github.com/QubesOS/qubes-issues/issues/8322#issuecomment-1904423204 is a blocker for PiKVM setup. AEM can probably be tested without sys-usb, I think dom0 can use USB as well.

SergiiDmytruk commented 6 months ago

Hm, PXE boot might be relying solely on DHCP. There is "Booting from PXE/LAN" message but nothing happens afterwards. Enabling network stack in settings had no visible effect.

Using boot menu to get into EFI shell does work. So if iPXE or GRUB EFI binary will be reachable and network will work in them (ping in EFI shell doesn't produce any output), network boot should be possible.

Regarding SoL, turns out power management is done via ipmitool but console implementation uses ipmiconsole binary from freeipmi. Either way didn't work for me until I recalled

After setting SOL to use 0x3f8 instead of default 0x2f8 serial can be easily read.

and switched "Legacy console redirection" setting in BIOS to SoL (port configuration is in "Super IO configuration", not sure if they relate, I went to serial settings first). Now both tools work manually with ipmitool having some difficulties. Still doesn't work in openQA and it might be because implementation is meant to be used with a fork of freeipmi.

krystian-hebel commented 6 months ago

Using boot menu to get into EFI shell does work. So if iPXE or GRUB EFI binary will be reachable and network will work in them (ping in EFI shell doesn't produce any output), network boot should be possible.

There is a way to mount floppy image, and it worked on X11 at least through SUM utility (sum -i <bmc_ip> -u ADMIN -p <bmc_password> -c MountFloppyImage --file <file_path>) and GUI, I don't recall if I tried plain IPMI. In any case, that was enough to start iPXE.

SergiiDmytruk commented 6 months ago

Couldn't make it work via Web-UI, it says image is being uploaded without any signs of that. ipmitool, ipmiutils and freeipmi don't seem to have this functionality.

SergiiDmytruk commented 6 months ago

PiKVM captures video but sometimes with some (probably insignificant artifacts) and not every time. Resolution can change on video reset and either leave the screen blank or show just colorful noise. Observed resolutions:

Firmware is available on SoL and it might be possible to get into Firmware via SoL (F11 for boot menu didn't seem to work, but Del did).

PiKVM's drive is recognized by firmware. I've noticed no issues with PiKVM's mouse or keyboard in Qubes OS after changing value of usbcore.authorized_default kernel parameter from 0 to 1 (set to zero on first boot).

Was able to boot wic-image of DTS connected via PiKVM's OTG and it had network working.

BMC's VNC works noticeably better than PiKVM, too bad it's some weird implementation.

Overall, seems like we can make Supermicro M11SDV-4C-LN4F work with PiKVM:

SergiiDmytruk commented 6 months ago

Power on/off via ipmitool works fine in generalhw scripts.

Video issues sometimes require rebooting PiKVM to fix. If it will remain as bad, can make PiKVM reboot a part of powering on process.

Booting with iPXE didn't work right away, in particular Ctrl-B prompt doesn't work. However, you can pass ipxe.efi commands as parameters when using EFI Shell (GRUB might not be able to do it), which starts Qubes OS installer. This led to hard-coding of fs0: and position in boot menu, will see how well it works.

SoL without RTE requires freeipmi fork which is nowhere to be found. Thought maybe changes were merged upstream, but doesn't seem like it. The issue is that it wants a terminal session, for now using script tool to provide terminal and that seems to work.

Latest test run started the installer. There are still issues:

krystian-hebel commented 6 months ago

However, you can pass ipxe.efi commands as parameters when using EFI Shell (GRUB might not be able to do it), which starts Qubes OS installer.

Does this start the installer in UEFI mode? It may cause problems later in GRUB if that's the case. On the other hand, we will need this in next phase so testing if it at least installs this way wouldn't be a bad idea.

SergiiDmytruk commented 6 months ago

Looks like Ctrl-B issue is a known one. Esc+B works instead in such cases.

ustreamer patch used with OptiPlex didn't work here, probably a different video encoder gets used, so RGB<->BGR conversion should be moved to different function/file.

Does this start the installer in UEFI mode?

Ah, sure... systemd complained in dmesg about some EFI variable missing. I tried using ipxe.lkrn too (it also handles parameters), but it didn't work probably due to EFI GRUB. Thanks. I'll keep the code for testing this after things work in general.

SergiiDmytruk commented 5 months ago

Tried passing efi=noruntime parameter to the installer in case it would be enough to block EFI stuff, but it didn't work (/sys/firware/efi was still there). Installed GRUB on an image uploaded to PiKVM, so it could be started in legacy mode to load ipxe.lkrn and that works.

As for RGB<->BGR conversion, I forgot that I forced CPU encoded for OptiPlex. Doing the same here worked as well, but it might be worth trying to swap bytes for M2M encoder as CPU already has hard time keeping up, will see how well it works.

For better reliability can parse EFI boot menu to know where PiKVM's drive is at and because GRUB has hard-coded commands for iPXE, the rest of the boot process should not need anything from openQA server and thus must be reliable.

SergiiDmytruk commented 5 months ago

For better reliability can parse EFI boot menu to know where PiKVM's drive is at

That seems to work well.

because GRUB has hard-coded commands for iPXE

Now boot image is updated by openQA to get rid of hard-coded IP addresses for iPXE, which also works fine.

As for RGB<->BGR conversion, I forgot that I forced CPU encoded for OptiPlex. Doing the same here worked as well, but it might be worth trying to swap bytes for M2M encoder as CPU already has hard time keeping up, will see how well it works.

This does work, but the colors are still off and that's what capturing video in RGB format was supposed to fix. Captured video is also displaced to the right a bit (unlike on BMC), but needles seem to not care about that, it's the colors which break them. This could be related to video converter used to make the capture possible in which case the fix might need to have a form of some image post-processing to adjust colors if that can be done or teaching os-autoinst to care less about them.

SergiiDmytruk commented 5 months ago

Actually, RGB colors did change and even became closer to what openQA needles expect, but still not close enough. Tried to find a way to adjust colors, but not sure if there is a pattern. It's unknown how UYVU -> RGB24 conversion is done and to what degree it is reversible.

Also tried to modify os-autoinst to not care about colors that much. Grayscale isn't enough because brigtness also changes significantly. Trying to apply dynamic threshold to force black&white image didn't produce matches because one image is much bigger than the other and processed images end up not agreeing with each other.

Because of the converter, it might be impossible to get correct colors, so creating needles seems like the only feasible option by now. It also doesn't go that smoothly. Just reusing areas from corresponding upstream needles doesn't always work because they are somewhat offset by different amount.

SergiiDmytruk commented 5 months ago

Created most of the needles necessary for automatic installation. Maybe one or two more are missing for installation (one could also be problematic because button to press is almost entirely pushed off the screen) and then some for AEM.

krystian-hebel commented 5 months ago

I thought you were talking about few pixels to the right as sometimes happened on other PiKVMs, but this is much worse than that:

image

Any idea if this can be fixed in any way? By using different EDID maybe?

SergiiDmytruk commented 5 months ago

This is much worse, but I think it happens only on this particular firstboot screen and only when you return to it (looks fine at first). Don't know if EDID can affect this. The permanent right offset is a separate thing and is visible only on PiKVM.

SergiiDmytruk commented 4 months ago

I think openQA can install Qubes OS by now (segfault on PiKVM stopped installation in the middle and I just continued it, so the whole process wasn't continuous yet, but it either works or mostly works). Now updating AEM test for Supermicro: differences between TXT vs. SKINIT and machine-specific clearing of the TPM, also AEM-specific needles will be needed later.

SergiiDmytruk commented 4 months ago

AEM part also partially works. TPM is cleared. AEM gets installed, but I didn't get to needles because Xen reboots without printing any output. Not that it's fully functional, but it should print some logs when build from https://github.com/TrenchBoot/xen/pull/10 on top of 4.17.2-7, which does boot fine without AEM.

SergiiDmytruk commented 4 months ago

Added missing needles after making AEM work (Dom0 failed to load because Xen put it over TPM event log, so changed how the log is allocated). I also used wrong packages initially, which is why there was no output at all.

There is still an issue with reliability here: hardware works great on some days, drives me crazy on others when PiKVM needs multiple reboots to start grabbing video, then it can work but break VNC connection, Supermicro also misbehaves and takes a while to boot (reboots multiple times during firmware initialization before actually booting) but this might happen only after failed SKINIT boot. When it boots, some parts of serial output isn't there making tests fail. In other words, it works, but might not always be usable and can make many retries before it finishes successfully.

Video adapter and/or PiKVM might be at fault and it's hard to work around that. Breaking VNC connection could theoretically be worked around in os-autoinst to make it reconnect, but my quick attempt that looks like it should work for some reason doesn't.

SergiiDmytruk commented 4 months ago

Added rebooting of PiKVM to the flash script, but it doesn't seem to make things more stable.

SergiiDmytruk commented 4 months ago

openQA seems to work now for both Qubes OS setup and installing and testing AEM on Supermicro. But using it is a pain, something needs to be done about hardware to make this practically usable. As it currently stands, one might have to restart tasks 20-30 times or more to see them finish successfully (and aem-first-boot might need to be skipped depending on when it has failed before).

Scripts are in https://github.com/TrenchBoot/openqa-tests-qubesos/tree/3mdeb-lab/generalhw/supermicro