NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.71k stars 13.85k forks source link

regression: tianocore/edk2 (ovmf uefi) update breaks qemu/kvm vm's #164064

Closed lodi closed 9 months ago

lodi commented 2 years ago

Describe the bug

Commit 9222b68380eb9195c7c87ca460fcb626d4e600ce broke my virt-manager vm's. Every vm starts, but then immediately hangs at a black screen with 100% cpu usage on 1 core. htop shows that thread split roughly 30/70 between kernel and virtualization. I tried leaving the vm's running for a few hours but that made no difference. I tried clearing the nvram folder, /var/lib/libvirt/qemu/nvram/* but that didn't help either.

9222b68380eb9195c7c87ca460fcb626d4e600ce updated the OVMF UEFI firmware my vm's were using, so I bisected the edk2 source and found the offending upstream commit. It's something to do with TPM, and indeed I have software TPM enabled in my system config:

  <!-- virt-manager config -->
  <os>
    <type arch="x86_64" machine="pc-q35-5.0">hvm</type>
    <loader readonly="yes" type="rom">/run/libvirt/nix-ovmf/OVMF_CODE.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/ubuntu_VARS.fd</nvram>
  </os>
{
  virtualisation.libvirtd = {
    enable = true;
    onBoot = "ignore";
    onShutdown = "shutdown";

    qemu = {
      ovmf = {
        enable = true;
        package = pkgs.OVMFFull; # (1)
      };
      runAsRoot = false;
      swtpm.enable = true; # (2)
    };
  };
}

I tried removing line 2 but that didn't change anything (the particular ubuntu vm I'm trying to boot isn't even using tpm). I also tried removing line 1 to force the system to use the pkgs.OVMF default, but that started throwing qemu: could not load PC BIOS '/run/libvirt/nix-ovmf/OVMF_CODE.fd' errors even though that file exists. I'm not sure what else I can try short of recreating the vm's from scratch.

Expected behavior

VM boots into "tianocore" uefi logo, then proceeds to boot as normal.

Additional context

Ryzen 3950x Asrock X570 Taichi

Notify maintainers

@alyssais

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.15.28, NixOS, 22.05 (Quokka)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.7.0`
 - channels(root): `"nixos"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
putchar commented 2 years ago

Hello

Intel cpu, I got the same issue

[ λ gitlab.com/putchar/dotnix ]
19:14 $ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.16.13, NixOS, 22.05 (Quokka)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.6.1`
 - channels(putchar): `""`
 - nixpkgs: `/etc/nixpkgs/channels/nixpkgs`
pshirshov commented 2 years ago

Similar issue, AMD CPU and VFIO for GPU. Works without VFIO.

pshirshov commented 2 years ago

I've tried to revert OVMF back but it doesn't help:

  virtualisation = {

    libvirtd = {
      qemu.ovmf.package = pkgs.OVMFFull.override {
        secureBoot = true;
        csmSupport = true;
        httpSupport = true;
        tpmSupport = true;
        edk2 = pkgs.edk2.overrideAttrs (oldAttrs: rec {
          version = "202108";

          src = pkgs.fetchFromGitHub {
            owner = "tianocore";
            repo = "edk2";
            rev = "edk2-stable202108";
            fetchSubmodules = true;
            sha256 = "1ps244f7y43afxxw6z95xscy24f9mpp8g0mfn90rd4229f193ba2";
          };

          patches = [
            # Pull upstream fix for gcc-11 build.
            (pkgs.fetchpatch {
              name = "gcc-11-vla.patch";
              url = "https://github.com/google/brotli/commit/0a3944c8c99b8d10cc4325f721b7c273d2b41f7b.patch";
              sha256 = "10c6ibnxh4f8lrh9i498nywgva32jxy2c1zzvr9mcqgblf9d19pj";
              # Apply submodule patch to subdirectory: "a/" -> "BaseTools/Source/C/BrotliCompress/brotli/"
              stripLen = 1;
              extraPrefix = "BaseTools/Source/C/BrotliCompress/brotli/";
            })
          ];
        });

      };

      qemu.swtpm.enable = true;
    };
putchar commented 2 years ago

I've tried to revert OVMF back but it doesn't help:

  virtualisation = {

    libvirtd = {
      qemu.ovmf.package = pkgs.OVMFFull.override {
        secureBoot = true;
        csmSupport = true;
        httpSupport = true;
        tpmSupport = true;
        edk2 = pkgs.edk2.overrideAttrs (oldAttrs: rec {
          version = "202108";

          src = pkgs.fetchFromGitHub {
            owner = "tianocore";
            repo = "edk2";
            rev = "edk2-stable202108";
            fetchSubmodules = true;
            sha256 = "1ps244f7y43afxxw6z95xscy24f9mpp8g0mfn90rd4229f193ba2";
          };

          patches = [
            # Pull upstream fix for gcc-11 build.
            (pkgs.fetchpatch {
              name = "gcc-11-vla.patch";
              url = "https://github.com/google/brotli/commit/0a3944c8c99b8d10cc4325f721b7c273d2b41f7b.patch";
              sha256 = "10c6ibnxh4f8lrh9i498nywgva32jxy2c1zzvr9mcqgblf9d19pj";
              # Apply submodule patch to subdirectory: "a/" -> "BaseTools/Source/C/BrotliCompress/brotli/"
              stripLen = 1;
              extraPrefix = "BaseTools/Source/C/BrotliCompress/brotli/";
            })
          ];
        });

      };

      qemu.swtpm.enable = true;
    };

hello While using flake, I can make it work with nixos-stable input or the following commit

  inputs = {
    nixos-stable.url = "github:NixOS/nixpkgs/nixos-21.11";
    nixos-unstable.url = "github:NixOS/nixpkgs/nixos-unstable";

    ## OVMFFull working commit
    #edk2-202108.url = "github:NixOS/nixpkgs/cab0dd3777d0e98ff37fd804fa6f8797331c85fd";

};
  outputs = inputs @ {self, ...}: 

    customNixOSModules = import ./NixOS/modules;

    nixosConfigurations = {
      desktop = inputs.nixos-unstable.lib.nixosSystem {
        system = "x86_64-linux";
        specialArgs = {inherit inputs;};
        modules = [
          ./NixOS/hosts/desktop/configuration.nix
          self.customNixOSModules
        ];
      };
};

and inside my virt-manager config

{
  config,
  pkgs,
  lib,
  inputs,
  ...
}:
with lib; let
  cfg = config.customNixOSModules.virt;
in {
  options.customNixOSModules.virt = {
    enable = mkOption {
      type = types.bool;
      default = false;
      description = ''
        whether to enable virt globally or not
      '';
    };
  };

  config = mkIf cfg.enable {
    virtualisation = {
      libvirtd = {
        enable = true;
        qemu = {
          ovmf.enable = true;
          swtpm.enable = true;
          #ovmf.package = inputs.edk2-202108.legacyPackages.x86_64-linux.OVMFFull;
          ovmf.package = inputs.nixos-stable.legacyPackages.x86_64-linux.OVMFFull;
        };
      };
    };
putchar commented 2 years ago

hello guys I just updated all of my virt config with the latest revision of nixos-unstable And it seems to be working again.

Could you check if your config does work on your side ?

putchar commented 2 years ago

i can confirm my machines works with the following declarative configuration

    virtualisation = {
      libvirtd = {
        enable = true;
        qemu = {
          ovmf.enable = true;
          ## as of now, with nixpkgs-unstable,
          ## when i use OVMFFull it doesn't work anymore
          #ovmf.package = pkgs.OVMFFull;
          swtpm.enable = true;
        };
      };
    };

When I use OVMFFull package it is broken

lodi commented 2 years ago

I tried latest nixpkgs-unstable with and without OVMFFull, with old vms and brand new vms, and in all cases I still have the original problem. I can only get it to work by downgrading edk2.

pshirshov commented 2 years ago

Yeah, the same.

putchar commented 2 years ago

@pshirshov @lodi i am on the nixos discord server as well as in matrix (any nix / nixos channels really) i use the same nick there perhaps we could chat and see what is not working for you and update this issue with our feedback

lodi commented 2 years ago

Ok so it looks like OVMF default (not OVMFFull... that one hangs 100% of the time) + new vms + minimal config + <loader readonly="yes" type="pflash">/run/libvirt/nix-ovmf/OVMF_CODE.fd</loader> does indeed work. (Previously I had it set to type="rom" to work around some libvirt snapshot issue).

Here's my verbatim libvirt section with changes made to bring it as close to @putchar's config as possible:

{
  # virtualisation.virtualbox.host.enable = true;

  virtualisation.libvirtd = {
    enable = true;
    onBoot = "ignore";
    onShutdown = "shutdown";

    qemu = {
      ovmf = {
        enable = true;
        # package = pkgs.OVMFFull;
      };
      # runAsRoot = false;
      swtpm.enable = true;
      verbatimConfig = ''
        namespaces = []

        cgroup_device_acl = [
          "/dev/null", "/dev/full", "/dev/zero",
          "/dev/random", "/dev/urandom",
          "/dev/ptmx", "/dev/kvm", "/dev/kqemu",
          "/dev/rtc", "/dev/hpet", "/dev/net/tun",

          "/dev/input/by-id/uinput-kinesis",
          "/dev/input/by-id/uinput-m2k",
          "/dev/input/by-id/uinput-mm710"
        ]
      '';
    };
  };

  # users.users.qemu-libvirtd.extraGroups = [ "input" "audio" "jackaudio" ];
  users.users.lodi.extraGroups = [ #"qemu-libvirtd"
    ... ];
}
pshirshov commented 2 years ago

So, I've re-checked the state of this issue. OVMFull is still broken for me (TPM+VFIO), default OVMF now works.

UPD: nope, I was wrong. VMs do boot, but tpm is broken (swtpm.enable set to true)

mlen commented 2 years ago

I can confirm that the full package is still broken, but the default one works fine

pshirshov commented 2 years ago

To be precise, it works fine unless you need TPM.

pshirshov commented 2 years ago

Nope, that doesn't fix the problem for me. VMs with VFIO devices just busy-hang immediately. It seems like that something is wrong about VFIO in the full builds, not TPM.

pshirshov commented 2 years ago

So, below is a summary of my issues:

1) ovmffull on nixpkgs/cab0dd3777d0e98ff37fd804fa6f8797331c85fd: win11 VMs with Nvidia GPU under VFIO and, obviously, with TPM, work. 2) ovmffull after nixpkgs/cab0dd3777d0e98ff37fd804fa6f8797331c85fd: win11 VMs with Nvidia GPU under VFIO hang immediately on start, before GPU is initialized. If I add virtual video (QXL), my VFIO GPU gets initialized and the VM hangs after displaying TianoCore logo. If I use ramfb, VM hangs without displaying anything. 3) Same applies to https://github.com/NixOS/nixpkgs/pull/181498 4) TPM indeed works well (win11 boots) for https://github.com/NixOS/nixpkgs/pull/181498 5) TPM works well (win11 boots) for on nixpkgs/nixos-unstable. This is expected, I believe win11 does not use TPM 1.0. 6) The VFIO hang manifests despite presence/absence of shmem devices (for looking glass). 7) The VFIO hang manifests despite presence/absence of TPM devices.

So, the actual problem is not related to TPM, the actual problem is that VFIO is somehow broken in new ed2k, but for some reason it only manifests in full OVMF builds.

pshirshov commented 1 year ago

Still broken on master (edk2 202205)

pshirshov commented 1 year ago

https://bugzilla.tianocore.org/show_bug.cgi?id=4137

pshirshov commented 1 year ago

The solution is to build OVMF without CSM:

      qemu.ovmf.packages = [
        (pkgs.OVMF.override {
          secureBoot = true;
          csmSupport = false;
          httpSupport = true;
          tpmSupport = true;
        }).fd
      ];

Probably csmSupport should be set to false by default and, preferably, should be an option.

I'm not going to submit any patches because the contributor experience is really disappointing.

lodi commented 1 year ago

@pshirshov, thanks for the investigation. Disabling csmSupport worked for me too.

@putchar, can you recommend what I should do next here? OVMFFull comes from:

nixpkgs/pkgs/top-level/all-packages.nix
{ 
  OVMF = callPackage ../applications/virtualization/OVMF { };
  OVMFFull = callPackage ../applications/virtualization/OVMF {
    secureBoot = true;
    csmSupport = true;
    httpSupport = true;
    tpmSupport = true;
  };
}

Reading through the bugzilla link above, it looks like the OVMF devs themselves recommend disabling CSM (compatibility support module, i.e. bios emulation). This would be a breaking change but I find myself agreeing with their reasoning--why would you want SeaBIOS handling part of your OVMF stack when you can just switch the vm to SeaBIOS? Alternatively I can just introduce an OVMFSecure attribute with only secureBoot and tpmSupport enabled; this will handle the common case for people wanting to run win11 vm's.

pshirshov commented 1 year ago

As I said before, the contributor experience is awful. IMO it's more convenient to fix things locally/adhoc than to try to push any patches to the upstream which is very unwelcoming and discouraging any contributions.

lodi commented 10 months ago

I tried this again with pkgs.OVMFFull.fd on a new system, with a new nvidia gpu, and I still have exactly the same problem as before; it only works if I override the attribute and disable the csmSupport attribute. I'm going to try submitting a pull request to add an "OVMFSecure" entry to make it easier to config a windows 11 vm out of the box.

lodi commented 9 months ago

Ok, turns out CSM also interferes with SecureBoot among other features, so it was decided to just remove CSM support from OVMFFull in commit acde5fd.

Now I'm just using this for a Win11 guest with vfio passthrough of an nvidia gpu:

qemu = {
  ovmf = {
    enable = true;
    packages = [ pkgs.OVMFFull.fd ];
  };
  swtpm.enable = true;
}

I think we can close this issue since you now have to explicitly enable CSM support in an override to create the conditions for the hang.

lodi commented 9 months ago

Fixed in https://github.com/NixOS/nixpkgs/commit/acde5fd0270d8d9d4f5ffe6b7ddafeba9ac652f9.

onny commented 9 months ago

Ok, turns out CSM also interferes with SecureBoot among other features, so it was decided to just remove CSM support from OVMFFull in commit acde5fd.

Now I'm just using this for a Win11 guest with vfio passthrough of an nvidia gpu:

qemu = {
  ovmf = {
    enable = true;
    packages = [ pkgs.OVMFFull.fd ];
  };
  swtpm.enable = true;
}

I think we can close this issue since you now have to explicitly enable CSM support in an override to create the conditions for the hang.

What frontend are you using? Does it work with Gnome-Boxes too?

lodi commented 9 months ago

I'm using virt-manager. Just learning about Boxes today... looks nice but I've never used it.

onny commented 9 months ago

I'm using virt-manager. Just learning about Boxes today... looks nice but I've never used it.

It's possible to choose between UEFI and BIOS in Boxes when setting up a new VM. Somehow UEFI don't work for me yet, it always boots in BIOS mode

mannp commented 9 months ago

Fixed in acde5fd.

Intel cpu graphics broke for me with Virtual manager....working fine up until now.

      qemu = {
        ovmf = {
          enable = true;
          #packages = [ pkgs.OVMFFull.fd ];
          packages = [ pkgs.OVMF.fd ];
        };
        swtpm.enable = true;
        runAsRoot = false;
      };
lodi commented 9 months ago

@mannp It broke for you with pkgs.OVMF.fd? Basic OVMF wasn't changed; just OVMFFull...

mannp commented 9 months ago

@mannp It broke for you with pkgs.OVMF.fd? Basic OVMF wasn't changed; just OVMFFull...

Strange, I lost opengl and 3d after this update. I had commented out full as it was previously broken for me.

Will try some options again.