NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.12k stars 14.16k forks source link

How to use custom filesystems inside a virtual machine with bootloader? #228572

Open spietras opened 1 year ago

spietras commented 1 year ago

Issue description

I'm using a custom filesystem layout (tmpfs on root + ZFS for persistence, but that doesn't really matter). When installing my configuration on a new machine I use an installer script that creates partitions and filesystems. But before that, I want to test my configuration inside a qemu virtual machine on my development computer. To do that, I followed the tests in #178531 with slight modifications, defined my filesystems, set useDefaultFilesystems to false and added commands in postDeviceCommands that basically mimic what the installer script is doing with setting up partitions and filesystems, except the boot partition.

This approach worked for quite some time. The last commit that it works on is 13ea5dc163f5abde5ed5954b75179eee7c420a8e. If you use the configuration I provided below, the system boots correctly and you can run lsblk -o NAME,FSTYPE,PARTLABEL,LABEL,SIZE,MOUNTPOINTS to get something similar to:

NAME    FSTYPE  PARTLABEL           LABEL   SIZE    MOUNTPOINTS
vda                                         1G                 
└─vda1  ext4    root                root    1022M   /          
vdb                                         120M               
├─vdb1          BIOSBootPartition           1007K              
└─vdb2  vfat    EFISystem           BOOT    119M    /boot      

I don't really know the internals of what's happening, but it seems that there are two disks available in this situation with the boot partition being on the second disk. This is not the same situation as my target configuration, because I have only one disk and the boot partition is on the same disk as other partitions, but it allows you to test almost everything.

Now with the newest commit things don't work anymore. I get an error saying:

error: EFI variables can be used only with a partition table of type: hybrid, efi or legacy+gpt.

I investigated a little bit, and the commit that introduces the change that outputs the error is 76c7b656bfa9b20a4172f7901285560db4c2c695 by @RaitoBezarius. It seems that the whole virtual machine image-building workflow got an overhaul and my approach doesn't work anymore.

I feel that it's not a bug in the internal logic, but that my approach was wrong. Can someone point me in the right direction of what should I do to be able to use my custom filesystems inside a virtual machine with a bootloader?

Steps to reproduce

This is a minimal reproducible example. Create flake.nix:

{
  inputs.nixpkgs = {
    type = "github";
    owner = "NixOS";
    repo = "nixpkgs";
    rev = "5e9303c061a896530037496222950feb7284d54d";
    # it works with the commit below
    #rev = "13ea5dc163f5abde5ed5954b75179eee7c420a8e";
  };

  outputs = inputs: {
    nixosConfigurations = {
      foo = let
        system = "x86_64-linux";
        pkgs = import inputs.nixpkgs {inherit system;};
      in
        inputs.nixpkgs.lib.nixosSystem {
          system = system;
          modules = [
            {
              boot.loader.systemd-boot.enable = true;
              system.stateVersion = "23.05";
              users.users.root.password = "";

              virtualisation.vmVariantWithBootLoader = {
                boot.initrd.postDeviceCommands = ''
                  ${pkgs.parted}/bin/parted --script /dev/vda -- mklabel gpt mkpart root 0% 100%
                  ${pkgs.e2fsprogs}/bin/mkfs.ext4 -L root /dev/disk/by-partlabel/root
                '';

                virtualisation = {
                  fileSystems."/" = {
                    device = "/dev/disk/by-label/root";
                    fsType = "ext4";
                    neededForBoot = true;
                  };

                  useDefaultFilesystems = false;
                  useEFIBoot = true;
                };
              };
            }
          ];
        };
    };
  };
}

and run:

nix --extra-experimental-features 'nix-command flakes' run .#nixosConfigurations.foo.config.system.build.vmWithBootLoader
alyssais commented 1 year ago

228346 might be relevant.

RaitoBezarius commented 1 year ago

Hey there, original author of the patch which broke your usecase, apologies for this, we tested it over nixpkgs but no one had usecases like yours, so I am discovering it.

My intuition is the following: make-disk-image is responsible for that assert error, it has no way to know you want to pass a GPT partition table because you are doing it in a string fashion in the initrd and useDefaultFilesystems pass none as partition table.

Obviously, now the semantic question is: should useDefaultFilesystems = false; still take care of the default partition table?

If the answer is yes, then we probably want a useDefaultPartitionTable option in the future and make it clear that partition table setup is assured by the QEMU test infrastructure, you only have to deal with partitions.

If the answer is no, I am not really certain what is the best way. Clearly, the user would need to provide the system image in that case and fill out all the blanks for the exact usecase they want to have.

It might be much more complicated because some of them relies on knowledge on what is a closure information is, etc. But it would provide maximum flexibility towards those usecases.

RaitoBezarius commented 1 year ago

https://github.com/NixOS/nixpkgs/pull/228734 is the implementation of the "the answer is yes" branch of my explanation.

RaitoBezarius commented 1 year ago

Tested with

{
  inputs.nixpkgs = {
    type = "github";
    owner = "NixOS";
    repo = "nixpkgs";
    rev = "84966c085e2b8fe55748959f4f2fc5957f937d28";
    # it works with the commit below
    #rev = "13ea5dc163f5abde5ed5954b75179eee7c420a8e";
  };

  outputs = inputs: {
    nixosConfigurations = {
      foo = let
        system = "x86_64-linux";
        pkgs = import inputs.nixpkgs {inherit system;};
      in
        inputs.nixpkgs.lib.nixosSystem {
          system = system;
          modules = [
            {
              boot.loader.systemd-boot.enable = true;
              system.stateVersion = "23.05";
              users.users.root.password = "";

              virtualisation.vmVariantWithBootLoader = {
                boot.initrd.postDeviceCommands = ''
                  ${pkgs.parted}/bin/parted --script /dev/vda -- mklabel gpt mkpart root 0% 100%
                  ${pkgs.e2fsprogs}/bin/mkfs.ext4 -L root /dev/disk/by-partlabel/root
                '';

                virtualisation = {
                  fileSystems."/" = {
                    device = "/dev/disk/by-label/root";
                    fsType = "ext4";
                    neededForBoot = true;
                  };

                  useDefaultFilesystems = false;
                  useEFIBoot = true;
                };
              };
            }
          ];
        };
    };
  };
}
spietras commented 1 year ago

This is going probably too far, but if I would take a shot at redesigning the whole virtual machine stuff related to a NixOS configuration from an ease-of-use point of view, something like this would probably give users a lot of flexibility:

{
  outputs = inputs: {
    nixosConfigurations = {
      foo = inputs.nixpkgs.lib.nixosSystem {
        system = "x86_64-linux";

        # Target system configuration
        systemModules = [
          (
            # Pass only system configuration as an input
            {config, ...}: {
              # Concerning virtualisation possibilities inside the target system
              virtualisation = {
                docker.enable = true;
              };
            }
          )
        ];

        # Virtual machine variant of the system for outside usage
        vmModules = [
          # You can override the system configuration here
          (
            # Pass both system and vm configurations as inputs
            {
              systemConfig,
              vmConfig,
              ...
            }: {
              networking.hostName = "${systemConfig.networking.hostName}-vm";
            }
          )
          # And you can define virtual machine configuration inside 'vm' attribute
          {
            vm = {
              memorySize = 2048;

              # Each disk image can be created with qemu-img
              # Then parted can be used to create partitions on the image
              # Then we can mount the image inside a temporary virtual machine and run a script to set up the filesystems
              # And finally we can attach the images as virtio drives in qemu for the target virtual machine
              diskImages = [
                # Position in the list is the drive index, so this is /dev/vda (next one would be /dev/vdb)
                {
                  # Options to use in qemu-img create
                  size = "1G";
                  format = "qcow2";
                  options = {
                    preallocation = "full";
                  };

                  # This is the path on the host system
                  # It is persisted between runs
                  # But if the configuration changes, it will be replaced with a new image
                  file = "disk.qcow2";

                  # As in mklabel in parted
                  partitionTable = "gpt";

                  # Each one translates to a mkpart in parted
                  # They will be executed in order so each one gets its own predictable number
                  partitions = [
                    # We know this is /dev/vda1 (and also /dev/disk/by-partlabel/boot since we're using GPT)
                    {
                      name = "boot";
                      filesystem = "fat32";
                      start = "1MB";
                      end = "512MB";
                      flags = ["boot" "esp"];
                    }
                    # And this is /dev/vda2 (and also /dev/disk/by-partlabel/root since we're using GPT)
                    {
                      name = "root";
                      filesystem = "ext4";
                      start = "512MB";
                      end = "100%";
                    }
                  ];
                }
              ];

              # Additionally, make it possible to attach any host drives
              hostDrives = [
                # CD-ROM drive from the host
                {
                  # Translates to -drive file=/dev/cdrom,media=cdrom in qemu
                  file = "/dev/cdrom";
                  media = "cdrom";
                }
              ];

              # Run a script inside a temporary virtual machine to set up the filesystems
              # I guess we can use vmTools.runInLinuxVM for this and attach the disk images as virtio drives (same as the target virtual machine)
              # We could use a more static configuration instead of a script, but it's hard to cover all cases (e.g. ZFS)
              # Using a script we can just run whatever commands we want
              filesystemsSetup = ''
                mkfs.fat -F 32 -n boot /dev/vda1 # or /dev/disk/by-partlabel/boot
                mkfs.ext4 -L root /dev/vda2 # or /dev/disk/by-partlabel/root
              '';

              # I don't know much about bootloaders
              # But I guess this is enough info to install one
              bootDevice = "/dev/vda";
              bootPartition = "/dev/vda1";
            };
          }
        ];
      };
    };
  };
}

And then just be able to use this to run the virtual machine:

nix run .#nixosConfigurations.foo.vm

This is not necessary, but the reason I moved the virtual machine configuration outside of the usual modules is that:

  1. Virtual machine configuration is not a part of the system configuration, it's on another layer.
  2. So that it's easier to make the virtual machine accessible as #nixosConfigurations.foo.vm instead of #nixosConfigurations.foo.config.system.build.vm (And I think that should go deeper than just vm, why is system.build inside a config? The word config implies you can find static options there which are the input to some process, and what is in system.build is obviously an output of that process).

But this would require a lot of changes and I probably overlooked many issues with this approach. But for sure, there needs to be a way to give users more flexibility with partitioning the disk(s).

For now, I just dropped the bootloader and started to use vmVariant instead of vmVariantWithBootLoader. The system boots directly and I can set up the partitions and filesystems in initrd. Can't test the bootloader this way, but it's not that much of a loss for me.

RaitoBezarius commented 1 year ago

There's a lot to unpack in your message, apologies if I don't answer everything.

Firstly, I don't use flakes, and they are experimental, so your top-level API is really specific to nixosSystems I believe, so I think you'd have to report this separately to get it this way.

Virtual machine configuration is not a part of the system configuration, it's on another layer.

This is complicated, do we want to have makeItAVM :: NixOSConfig -> NixOSConfig or makeItAVM :: NixOSConfig -> VMConfig, etc.

Modelling this properly is still an open problem IMHO.

diskImages, hostDrives

The first two are list-driven APIs, it's unfortunately a bad idea IME.

NixOS modules can perform spooky interaction at distance, you cannot predict the order of your disks, therefore you cannot have reliable tests (or anything).

I am in the process of killing and deprecating APIs, we should rather use attrset-driven APIs.

diskImages."root" = ... partitions."vda" = ...

Look at how https://github.com/nix-community/disko works for example.

filesystemsSetup

Ideally, we should avoid string-driven API because they do not carry any structured information so that internal libraries can use them to do smart things.

I'd prefer much more structured things, why not plug disko into filesystems and let it drive the partitioning/mounting correctly for example.

For now, I just dropped the bootloader and started to use vmVariant instead of vmVariantWithBootLoader. The system boots directly and I can set up the partitions and filesystems in initrd. Can't test the bootloader this way, but it's not that much of a loss for me.

Did you try my PR? If that's enough for you, and you are not interested into it, let's close this issue because it's not actionable anymore IMHO.

spietras commented 1 year ago

This is complicated, do we want to have makeItAVM :: NixOSConfig -> NixOSConfig or makeItAVM :: NixOSConfig -> VMConfig, etc.

Modelling this properly is still an open problem IMHO.

I agree. But the way it works now just feels kinda messy to me.

NixOS modules can perform spooky interaction at distance, you cannot predict the order of your disks, therefore you cannot have reliable tests (or anything).

I am in the process of killing and deprecating APIs, we should rather use attrset-driven APIs.

diskImages."root" = ... partitions."vda" = ...

I would be happy with whatever that works. I just tried to make a data model that enforces the order of items. While using virtio drives in qemu we can't specify how to device will be called, it's based on the order, so with ordered items we always know what devices they map to (e.g. first item will be /dev/vda). I guess that with attrsets there would be no way to enforce that a given disk should be at /dev/vda, it would be random or based on some sorting of keys.

I'd prefer much more structured things, why not plug disko into filesystems and let it drive the partitioning/mounting correctly for example.

I have never used disko, but it seems like it's able to deal with a lot of different configurations, so if it's somehow possible to use it to set up partitions and filesystems for vm image then it would be awesome. However, I'm afraid that it assumes that the user knows beforehand how the devices are called (viewing them from inside a NixOS installer) and with virtual machines there are no pre-existing devices, they need to be created based on our configuration. And even if declarative management can deal with a lot of situations, there will always be some exceptions. So I think that there should always be a possibility for the user to do things manually his own way. "Simple things should be simple, complex things should be possible".

Did you try my PR? If that's enough for you, and you are not interested into it, let's close this issue because it's not actionable anymore IMHO.

I tried it, but it only boots me to UEFI Interactive Shell with useDefaultPartitionTable = true. But one way or another, I can't really use the default partition layout, because my target layout is different. And with useDefaultPartitionTable = false it's the same story as before.

I guess we can close this for now. I'm sure someone will pick it up in the future because it's very useful to be able to reproduce your custom system as close to 1:1 as possible in the virtual machine. But it seems that a lot needs to be changed, discussed and agreed upon to make it possible.

RaitoBezarius commented 1 year ago

This is complicated, do we want to have makeItAVM :: NixOSConfig -> NixOSConfig or makeItAVM :: NixOSConfig -> VMConfig, etc. Modelling this properly is still an open problem IMHO.

I agree. But the way it works now just feels kinda messy to me.

Unfortunately, untangling all of this requires time.

NixOS modules can perform spooky interaction at distance, you cannot predict the order of your disks, therefore you cannot have reliable tests (or anything). I am in the process of killing and deprecating APIs, we should rather use attrset-driven APIs. diskImages."root" = ... partitions."vda" = ...

I would be happy with whatever that works. I just tried to make a data model that enforces the order of items. While using virtio drives in qemu we can't specify how to device will be called, it's based on the order, so with ordered items we always know what devices they map to (e.g. first item will be /dev/vda). I guess that with attrsets there would be no way to enforce that a given disk should be at /dev/vda, it would be random or based on some sorting of keys.

It's not desirable to depend on the ordering of your disks, when you have multiple layers of abstraction, something can insert a disk before you and after you and all of your tests depending on vda, vdb, vdc, vdd, vde needs to be shifted by one disk or more.

Unless you have a compelling case, I never find any use of having ordering for disks, you want to give them names and reference them through their names.

I'd prefer much more structured things, why not plug disko into filesystems and let it drive the partitioning/mounting correctly for example.

I have never used disko, but it seems like it's able to deal with a lot of different configurations, so if it's somehow possible to use it to set up partitions and filesystems for vm image then it would be awesome. However, I'm afraid that it assumes that the user knows beforehand how the devices are called (viewing them from inside a NixOS installer) and with virtual machines there are no pre-existing devices, they need to be created based on our configuration. And even if declarative management can deal with a lot of situations, there will always be some exceptions. So I think that there should always be a possibility for the user to do things manually his own way. "Simple things should be simple, complex things should be possible".

Did you try my PR? If that's enough for you, and you are not interested into it, let's close this issue because it's not actionable anymore IMHO.

I tried it, but it only boots me to UEFI Interactive Shell with useDefaultPartitionTable = true. But one way or another, I can't really use the default partition layout, because my target layout is different. And with useDefaultPartitionTable = false it's the same story as before.

Can you detail more about your usecase? Because your example code mentioned a GPT partition table and UEFI boot. Are you trying to test legacy protective MBR partition table or stuff like that?

I guess we can close this for now. I'm sure someone will pick it up in the future because it's very useful to be able to reproduce your custom system as close to 1:1 as possible in the virtual machine. But it seems that a lot needs to be changed, discussed and agreed upon to make it possible.

I would appreciate any help anyway. :)

spietras commented 1 year ago

It's not desirable to depend on the ordering of your disks, when you have multiple layers of abstraction, something can insert a disk before you and after you and all of your tests depending on vda, vdb, vdc, vdd, vde needs to be shifted by one disk or more.

Unless you have a compelling case, I never find any use of having ordering for disks, you want to give them names and reference them through their names.

I agree, it would be best to be able to explicitly name the disk. But if we are using qemu for the virtual machine and creating virtual disks, then I think we have no option to enforce the disk name. It's not our choice, it's qemu's design. But I'm not 100% sure about that. If it is possible to bind the name, then I'm 100% in favour of relying only on the name.

Can you detail more about your usecase? Because your example code mentioned a GPT partition table and UEFI boot. Are you trying to test legacy protective MBR partition table or stuff like that?

It's not that much different from the default setup. I just want to have one disk and possibly more partitions. The simplest example would be boot partition (for /boot), root partition (for /), home partition (for /home) and swap partition (for swap). And the reason I want more partitions inside the virtual machine is that I use more partitions in my physical machine setup and I want to match the two configurations as close as possible.

RaitoBezarius commented 1 year ago

It's not desirable to depend on the ordering of your disks, when you have multiple layers of abstraction, something can insert a disk before you and after you and all of your tests depending on vda, vdb, vdc, vdd, vde needs to be shifted by one disk or more. Unless you have a compelling case, I never find any use of having ordering for disks, you want to give them names and reference them through their names.

I agree, it would be best to be able to explicitly name the disk. But if we are using qemu for the virtual machine and creating virtual disks, then I think we have no option to enforce the disk name. It's not our choice, it's qemu's design. But I'm not 100% sure about that. If it is possible to bind the name, then I'm 100% in favour of relying only on the name.

It's not a QEMU limitation, you never use disks before udev has kicked in, therefore you can always use udev to rename them. It's our choice to not have those APIs.

Can you detail more about your usecase? Because your example code mentioned a GPT partition table and UEFI boot. Are you trying to test legacy protective MBR partition table or stuff like that?

It's not that much different from the default setup. I just want to have one disk and possibly more partitions. The simplest example would be boot partition (for /boot), root partition (for /), home partition (for /home) and swap partition (for swap). And the reason I want more partitions inside the virtual machine is that I use more partitions in my physical machine setup and I want to match the two configurations as close as possible.

You are mentioning partitions, but I still don't see the need for partition table at the moment.

Note that also the "test VM with bootloader" is not devised to test your filesystem mapping, we cannot do this because we do not have the declarative information of your partitioning. Some projects existed in the past to achieve this (nixpart for example).

Right now, the best bet is disko, if it becomes part of NixOS, it is possible to invent a mode where we partition a layout, install NixOS in it and test your VM completely.

In the meantime, this usecase (testing your filesystem mapping) is unsupported and writing custom code yourself to make it work is probably not a good idea because you are not testing the filesystem mapping, you are testing you wrote the correct code to put your fs in the right situation and there's nothing that says when you reinstall this machine you will type the same commands.

Thus, disko seems the way forward on this usecase.