Closed marcmo closed 2 years ago
One idea:
The following pieces would be bundled with the northstar runtime on the trusted host filesystem (so they should be in host page cache after first container start and all further starts do not need to do disk I/O on host filesystem):
The NPK container filesystem is still mounted as dm-verity device by the north runtime as today, but then not mounted into host filesystem, and instead provided as a virtual disk to the KVM.
The firecracker "common base image" just has a simple initrd ramdisk which contains a tiny init launcher, which starts the actual application from the attached virtual disk (without any more protection layers like minijail - we're within KVM anyway). Before launching the application the init needs to do a few things like setup mounts (/dev, /etc, /tmp, /proc, ...), setup the network interface, setup environment variables. To perform the application launch correctly, it would be necessary to also copy the manifest YAML into the container filesystem image by the build system.
We could even support read-only resource containers: if we provide every resource container image (dm-verity) block device as an additional virtual block device to the VM, the init system can mount them all inside the VM at the expected mount locations according to the manifest (as long as the numbering works).
One rather orthogonal issue is how to structure the virtual network for firecracker containers.
@bzld : I don't think I understand this point: >The NPK container filesystem is still mounted as dm-verity device by the north runtime as today, but then not mounted >into host filesystem, and instead provided as a virtual disk to the KVM.
Would the goal be to have the container's directory structure appear in the guest as a mounted fs? This should be possible; I don't know the semantics of how to specify additional disks to the guest, but I believe it is in the design.
>To perform the application launch correctly, it would be necessary to also copy the manifest YAML into the container >filesystem image by the build system.
I may be missing something, but what is the point of this?
One rather orthogonal issue is how to structure the virtual network for firecracker containers.
We need to figure out how to handle networking. The current (hack) is to create the required Tap device (KVM uses tapdevs) with an IP address of 172.16.something.northstar_index .
There is a related issue https://github.com/esrlabs/northstar/issues/5 regarding network namespaces that we also need to address .
It is trivial to specify additional disks; they show up in the guest as /dev/vdb , /dev/vdc, etc . They must be files with a recognisable filesystem on them. And yes, changes that the guest makes to the filesystem are then visible on the host. And no, this is not a mechanism for data sharing. :-)
Would the goal be to have the container's directory structure appear in the guest as a mounted fs? This should be possible; I don't know the semantics of how to specify additional disks to the guest, but I believe it is in the design.
Yes, or to be more precise: have the NPK filesystem image appear as a virtio block device, which the guest kernel then mounts. So we still do the loopback mound, and dm-verity mount on the host, but instead of mounting the dm-verity block device into host filesystem tree, we provide the block device as a virtual disk to the guest VM. (The motivation for this "common initrd + NPK virtual disk" split is to prevent the NPK creators from having to include a (firecracker-guest-kernel-compatible) init system into the container filesystem)
It is trivial to specify additional disks; they show up in the guest as /dev/vdb , /dev/vdc, etc . They must be files with a recognisable filesystem on them
I'd hope they don't need to be regular "files" on the host, but can be a dm-verity block device as well? Then that would fit our existing logic nicely.
And yes, changes that the guest makes to the filesystem are then visible on the host. And no, this is not a mechanism for data sharing
Yes, that doesn't do sharing, but might be the most simple way to offer (non-shared) data persistency to guest containers - where each container has just a single file on the host for its data. Although we probably need to set a fixed maximum size then upfront? Or can we do some sparse file trick here?
To perform the application launch correctly, it would be necessary to also copy the manifest YAML into the container filesystem image by the build system.
I may be missing something, but what is the point of this?
In our NPK manifest we can have various settings for the application launch: environment variables and command arguments. If we want to be able to launch NPKs without modification within firecracker, the guest init system needs to get them from somwhere within the VM to prepare the exec(). Or are there other/simpler options? (Kernel commandline? Magic virtual metadata filesystems?)
So we still do the loopback mound, and dm-verity mount on the host, but instead of mounting the dm-verity block device into host filesystem tree, we provide the block device as a virtual disk to the guest VM. Ah. I think I get it. So the 'standard' initrd that does the init stuff would then mount the container FS. The goal would be that the app developer would not need to care whether they were running in a normal container or in a VM.
Let me see if it's possible to pass the image to the VM where it can be mounted.
Although we probably need to set a fixed maximum size then upfront? Or can we do some sparse file trick here? Yes, sparse files work fine.
If we want to be able to launch NPKs without modification within firecracker, the guest init system needs to get them from somwhere within the VM to prepare the exec(). Or are there other/simpler options? (Kernel commandline? Magic virtual metadata filesystems?) There is a feature called MMDS (MicrovmMetaDataService) that I have not explored that is a mechanism to pass unstructured data between host and client. It is put/get based, mutiplexed over the virtio socket from host to guest. That might work.
Otherwise the only other way is to pass environment variables via key=value on the kernel command line. It works, but obviously limits how much data one can pass. Luckily, it is trivial to add args to the command line when instantiating the VM.
One more responsibility of that 'standard' init system might be:
(vsock details: https://github.com/firecracker-microvm/firecracker/blob/master/docs/vsock.md)
Notes from meeting on 28 October (also posted to slack)
Path for integrating Firecracker (FC) into North -- aka NorthCracker(tm) 1.) Filesystem Image handling: Verify if FC can pass through a dm verity block device to the VM. Goal is to make container FS image available to VM 2.) Vsock: develop protocol for passing configuration information from host to guest over vsock container manifest network setup information any additional setup information (environment variables, logging info, ??? ) 3.) Network setup: Provide static network configuration in container manifest. This implies that the entity that is integrating/deploying containers to a target must be responsible for ensuring addresses are valid, there are no collisions, and containers/VMs are reachable only as intended IPv4 address of the container/VM IPv4 netmask IPv4 gateway IPv4 address of host ? bridge name device type (tap or veth) -- device name ? 4.) Config handler: develop North component for sending configuration information to VM use vsock from step 2.) receive status and notification events from app in VM (failure, crash, etc. TBD) and do something useful 5.) Logging: develop logging infrastructure to receive logs from VM over vsock VM must not use serial output for logging; i.e. , no stdout/stderr 6.) VM init: package generic kernel and initrd and make images available for deployment with NorthCracker.
Sub tasks: Provide kernel and initrd for firecracker-container initialisation #173 Provide static network configuration in container manifest #172 Create protocol for passing configuration information from host to guest #171 Support using northstar npk filesystem for firecracker #170 create and teardown default network bridge at startup/exit #181 Need target shim/stub container for firecracker #204
Possible control flow: Manifest has tag that indicates VM instead of nstar container
Setup: Create firecracker config file from manifest (primarily drive configuration) Create tap device from manifest network config params using default bridge (bridge created at boot time) Launch firecracker with config file Launch vsock servers (Socat process) manifest server log server VM application status server
In VM: pull manifest from vsock using port specified on kernel command line config network from manifest config envir vars launch app as per manifest in background, wait for exit push return code to host status server over vsock
Teardown: After firecracker exit, retrieve status from status server tear down tap device exit with code from status server
I think we need to consider if the goal of transparency is actually reasonable.
What I mean is that the VM environment will be different than the environment found on the host, and the container writer must be aware of the differences.
I do not think it makes sense to make it totally transparent to the container writer if the container is running on the host or in a VM. I think the container writer must know and understand the differences.
As just a small example of the differences: startup time will be longer latencies to devices will be longer library environment will most likely be different (MUSL in VM vs most likely bionic on the host) available devices will be different
I think the best thing would be that we make the system integrators aware of the consequences/tradeoffs and the differences that are involved when running a container in different environments. Maybe it will even require some changes but if running north containers in firecracker should be a convenient option it should be relatively straightforward and primarily a question of what degree of isolation is really required. The fact that additional isolation incurs e.g. runtime costs is very understandable.
Archive fc issues
I think the goal would be to allow containers to run within a firecracker VM with no changes to their filesystem image. And with minimal constraints: one thing that cannot work due to the nature of firecracker (which only supports block device disks) are bind-mounts from the host filesystem, or writable bind-mounts shared between containers.