Add kiwi plugin to support self-contained build processes

schaefi commented 4 years ago

Problem description

Customers who build appliances for their private applications don't want to share this information in a public service. This means those customers will not use the open build service to manage or maintain the image build process. We have seen several times that customers setup a build environment in their own network and infrastructure and base that on container technologies.

However container technologies are a questionable environment to build OS images. A builder that builds OS images requires root access as well as access to kernel filesystems and subsystems like lvm and more. There are many low level operations in the process of an image build and all those are shared with the host in a container. Container technologies try hard to hide this from the container environment which leads to all kinds of problems like no privileges to create device nodes, incompatible device mapper library between host and container root file system, etc etc. There are workarounds to many of those issues but not to all and after all it's pretty clear that container environments are designed to run user-space processes isolated from the host and not requiring access to any low level system components like kernel, device nodes, loops, filesystems, device-mapper and so on.

Solution Idea

The self-contained build requirement for many customers is something we want to provide a better solution for in kiwi. We plan to provide:

A kiwi plugin that provides a new command e.g boxed-build
The plugin is based on fast booting VMs which provides a real virtual system around the build process
The plugin requires kvm to run the VMs
The plugin will make use of virtio pci passthrough to allow to share storage from the host to the guest. in a fast way. This should also be the only part that is shared between host and guest.
The VMs to run the boxed build should be as light-weight as possible and will be automatically build with kiwi in obs. obs serves as the delivery agent and the boxed-build command downloads/updates the VM on request

A possible self-contained build command could look like this:

kiwi-ng boxed-build --vm fedora --storage /path/to/shared/data \
    system build --description fedora-desc --target-dir myimage

description and target-dir will be prefixed with the selected storage path
parameters to the kiwi call will be passed as cmdline arguments to the kvm call
kvm should only need the kernel, we plan to boot without initrd and without bootloader for performance reasons
the actual kiwi call in the VM will be done by a systemd one-shot unit that reads the cmdline and starts kiwi.
with the storage pass through we expect the results to be available on the host at the same time the data is produced inside of the VM

All of this open for a discussion as usual :)

schaefi commented 4 years ago

I have done a very short smoke test with the available kernels on suse that has the required drivers (virtio) built in. I was successful with the following approach:

<package name="kernel-kvmsmall"/>

kvm -kernel vmlinuz-5.3.18-lp152.9-kvmsmall -append "root=/dev/vda1 console=ttyS0" -drive file=LimeJeOS-Leap-15.2.x86_64-1.15.2.raw,if=virtio -serial stdio

bootup time is super fast ;)

davidcassany commented 4 years ago

I really like the card and share most of the explanation done here.

Solution Idea

The self-contained build requirement for many customers is something we want to provide a better solution for in kiwi. We plan to provide:
* A kiwi plugin that provides a new command e.g boxed-build
* The plugin is based on fast booting VMs which provides a real virtual system around the build process
* The plugin requires kvm to run the VMs

I'd also be curious to test around the use of nspawn for launching containers with full systemd support. So in other words, while I definitely agree that the only real solution nowadays is based on VMs, we should still keep the door opened for other sandboxing technologies (docker, podman, nspawn, kvm, katacontainers, rootless chroot, etc.)

* The plugin will make use of virtio pci passthrough to allow to share storage from the host to the guest. in a fast way. This should also be the only part that is shared between host and guest.

Yes I believe the target should be for 9p virtio as a stable solution and keep an eye to virtio-fs (not sure if fully available on Factory yet) which is supposed to be a great improvement in terms of performance.

* The VMs to run the boxed build should be as light-weight as possible and will be automatically build with kiwi in obs. obs serves as the delivery agent and the boxed-build command downloads/updates the VM on request

Sure for that I'd suggest to also keep an eye to the microvm machine profile. Available on Factory but I could not manage to properly use it (I did not invest much time on it).

A possible self-contained build command could look like this:
kiwi-ng boxed-build --vm fedora --storage /path/to/shared/data \
    system build --description fedora-desc --target-dir myimage

Looks good to me, I am still wondering it is valuable to embed somehow the build environment within the config XML somehow. But probably it does not belong there.

* description and target-dir will be prefixed with the selected storage path

I thought it different, the target-dir is shared to the build env and the result written there. This can be easily done with a system unit that automounts the shared location, I did it here

* kvm should only need the kernel, we plan to boot without initrd and without bootloader for performance reasons

This would be awesome, I was not aware of kernel-kvmsmall, is it including the initrd?

* with the storage pass through we expect the results to be available on the host at the same time the data is produced inside of the VM

Yes this is the ideal solution, however expect significant performance penalty on 9p virtio pass-through mounts.

I have done a very short smoke test with the available kernels on suse that has the required drivers (virtio) built in. I was successful with the following approach:
<package name="kernel-kvmsmall"/>
kvm -kernel vmlinuz-5.3.18-lp152.9-kvmsmall -append "root=/dev/vda1 console=ttyS0" -drive file=LimeJeOS-Leap-15.2.x86_64-1.15.2.raw,if=virtio -serial stdio
bootup time is super fast ;)

In the range of 4-6s? or even less? I am asking because booting regular dracut PXE image with many systemd serivices disabled turned to be quite fast, 4~6s, so that I stopped trying to run specific esotheric boots with super constrained qemu VMs configs with optimizations that I am not sure to what extend are valuable in comparison of having a well known regular OS system.

I'll try to share my PoC asap so we can at least test some bottle necks like filesystem sharing.

Also JFYI I managed to boot plain squashfs (no filesystem embedded!) images with qemu-kvm, this is possible. In my brain, one of the desired scenarios for the builder VM is a compressed filesystem (small download) that makes use of local dynamically created disk for overlaying to allow persistence (like we do for live systems). This is nice because makes the VM immutable and the size of it configurable at runtime with very little cost (qemu-img cretae -f raw overlay.img <size> is not really demanding in terms of computation and, eventually, only called for the first build).

Sorry probably I mostly ranted rather than providing structured feedback :pray:

schaefi commented 4 years ago

In the range of 4-6s? or even less?

In this range with potential to be faster. Here are my numbers:

localhost:~ # systemd-analyze
Startup finished in 519ms (kernel) + 4.209s (userspace) = 4.729s
multi-user.target reached after 4.164s in userspace

The VM build I used is here:

https://build.opensuse.org/package/show/Virtualization:Appliances:SelfContained/suse

The script to startup the VM is as follows. Note I did not take into account the one time action to download the image. I think the image size can be made smaller but the xz compressed result is also ok for a one time download imho.

#!/bin/bash

image=SUSE-Box.x86_64-1.42.1.raw

# kernel extraction... maybe this can be done simpler/different because requires root permissions
if [ ! -e "vmlinuz" ];then
    root=$(kpartx -a -s -v -r "${image}" | cut -f3 -d" ")
    mount /dev/mapper/"${root}" /mnt
    cp /mnt/boot/vmlinuz .
    loop=$(echo $root | cut -f1-2 -dp)
    umount /mnt
    kpartx -d /dev/"${loop}"
    losetup -d /dev/"${loop}"
fi

# startup, can be done as normal user if setup correctly on the host
qemu-kvm -m 4096 \
    -nographic \
    -kernel vmlinuz \
    -append "root=/dev/vda1 console=ttyS0 rd.plymouth=0 plymouth.enable=0" \
    -drive file="${image}",if=virtio,driver=raw

Actually I think that are quite good numbers and uses only standard components from the distro so far. Let's compare this with your PXE based approach. I think one of the first decisions should be on that bootup/image concept.

Thanks

schaefi commented 4 years ago

Also JFYI I managed to boot plain squashfs

In this case I think you need an initrd with support for squashfs. My current approach was not using an initrd and no bootloader since that's the time consuming parts on bootup. For the immutable aspect of the box file I suggest to use kvm's snapshot feature:

-snapshot
           Write to temporary files instead of disk image files. In this case,
           the raw disk image you use is not written back

When doing so this also increases the startup:

localhost:~ # systemd-analyze 
Startup finished in 520ms (kernel) + 3.281s (userspace) = 3.801s 
multi-user.target reached after 3.238s in userspace

schaefi commented 4 years ago

I'd like to keep that box VM as simple as ever possible. Right now it's just an ext2 fs on an msdos table with one root partition. I think we don't need a sophisticated rootfs setup via squashfs/btrfs or alike. Given we find a good way to also deliver the kernel file I think the plugin code will be relatively small and straight forward. My biggest concern is on the storage pass through and the performance gap with it.

schaefi commented 4 years ago

I did a few more tests and setup changes to let you easily test my image and boot times. So the build is done at:

https://download.opensuse.org/repositories/Virtualization:/Appliances:/SelfContained/images/

The image download size is currently ~190M

Here is my crappy load script:

#!/bin/bash

box=SUSE-Box.x86_64-1.42.1-Build*.install.tar
image=SUSE-Box.x86_64-1.42.1.xz

wget \
   --user-agent=Mozilla \
   --content-disposition \
   -E -r -c -nd --no-parent -e robots=off \
   -A ${box} \
https://download.opensuse.org/repositories/Virtualization:/Appliances:/SelfContained/images/

mkdir -p box
for archive in ${box}; do
    tar -C box -xf ${archive}
    break
done

xz -d box/"${image}"

All of the above can be done in python much nicer. Code needed to do this nicely also exists in kiwi partly for the solver classes. I think we also need to add an "up-to-date" check such that we only upload a new vm image if there are content changes. This is easy as we deliver the .packages file

Once the load is done the actual call is simple:

#!/bin/bash

image=box/SUSE-Box.x86_64-1.42.1
kernel=box/SUSE-Box.x86_64-1.42.1.kernel

qemu-kvm -m 4096 \
    -nographic \
    -kernel "${kernel}" \
    -append "root=/dev/vda1 console=ttyS0 rd.plymouth=0 plymouth.enable=0 kiwi=\"--version\"" \
    -drive file="${image}",if=virtio,driver=raw \
    -snapshot

The above call gives me startup times between 3-5 seconds which I think is acceptable. The origin of the image is not changed due to the snapshot parameter. You could have multiple calls of this type. I have tested 5 instances at the same time on the same machine and the startup times were always between 3-5s. So that was promising. I think some of the systemd services could also be switched off but I haven't looked deeply into it

I have also added the kiwi call. From the above append line you see we expect "kiwi-ng --version" to be called.

systemctl status kiwi

● kiwi.service - Start kiwi build process
     Loaded: loaded (/usr/lib/systemd/system/kiwi.service; enabled; vendor prese
t: disabled)
     Active: inactive (dead) since Tue 2020-03-31 20:47:17 UTC; 59s ago
    Process: 245 ExecStart=/bin/run_kiwi (code=exited, status=0/SUCCESS)
   Main PID: 245 (code=exited, status=0/SUCCESS)

Mar 31 20:47:16 localhost systemd[1]: Started Start kiwi build process.
Mar 31 20:47:17 localhost run_kiwi[262]: KIWI (next generation) version 9.20.4
Mar 31 20:47:17 localhost systemd[1]: kiwi.service: Succeeded.

So far the contribution from my side. Let me know if this makes sense

Thanks

schaefi commented 4 years ago

I thought it different, the target-dir is shared to the build env and the result written there. This can be easily done with a system unit that automounts the shared location, I did it here

I looked at this part:

[Mount]
What=shared_volume
Where=/target_dir
Type=9p
Options=trans=virtio,version=9p2000.L

This is interesting and would eliminate the user to select a target. On The other hand we make that path /target_dir completely static and immutable inside of the binary image blob. Not sure if we want that... ?

schaefi commented 4 years ago

Did further tests with networking enabled

systemd-analyze 
Startup finished in 531ms (kernel) + 20.714s (userspace) = 21.246s 
multi-user.target reached after 20.676s in userspace

So this is the time it spent working on the dhcp request. Which brings me to the next part of the boxed build. We require the network to be up and running.

I'm not aware of any method to setup the guest network that would not require a host setup too. Personally I think bridged networking has a low setup effort on the host and can be pre-configured in the VMs easily. The requirement here would be:

outgoing interface on the host is a bridge
there is some dhcp on that interface such that the guest gets an ip

I know there are many other possibilities to the network question. I'm open to discuss other opportunities because I'm also not the expert in that field.

schaefi commented 4 years ago

I thought it different, the target-dir is shared to the build env and the result written there. This can be easily done with a system unit that automounts the shared location, I did it here

I looked at this part:
[Mount]
What=shared_volume
Where=/target_dir
Type=9p
Options=trans=virtio,version=9p2000.L
This is interesting and would eliminate the user to select a target. On The other hand we make that path /target_dir completely static and immutable inside of the binary image blob. Not sure if we want that... ?

Reading more about that I now better understand it. Yes this is great, let's do it that way

schaefi commented 4 years ago

9p works nicely

schaefi commented 4 years ago

Test on 9p with high I/O load showed many problems. I suggest to use the shared folder only to present the results. I'll do some further tests. kvm's snapshot feature should be helpful here

schaefi commented 4 years ago

We kicked off implementation of the project as plugin here:

https://github.com/OSInside/kiwi-boxed-plugin/pulls

All further conversations, issues, implementation regarding the boxbuild plugin will happen there. Tests on the concept were successful. If you want to take a look checkout the poc scripting in https://github.com/OSInside/kiwi-boxed-plugin/tree/master/poc

All boxes will be delivered as a service by us in the first version here:

https://build.opensuse.org/package/show/Virtualization:Appliances:SelfContained

Thanks

OSInside / kiwi