Best practices for delivering container content that executes on the host

cgwalters commented 4 years ago

We emphasize containers, but there are needs to execute code on the host. Package layering is one; it has some advantages and major disadvantages.

In OpenShift today, we have a pattern of privileged containers that lay down and execute some binaries on the host. This avoids the reboots inherent in layering (also doesn't require RPMs).

Recently however, this failed because running binaries targeting e.g. RHEL7 on a RHEL8 host may fail if they link to e.g. openssl. A general best practice here is really that the binaries need to be targeting the same userspace as the host. With e.g. statically linked Go/Rust type code one can avoid most issues but not all (and you really want to dynamically link openssl).

See this Dockerfile which pairs with this PR.

Further, I think we should move these binaries into e.g. /run/bin or so - if there's a higher level process that pulls the containers to a node on reboot (e.g. a systemd unit created via Ignition, or in the OpenShift case the kubelet), then having the binaries in /run helps ensure that if e.g. the container is removed, at least the binaries will go away on reboot.

That said...we may consider even shipping something like a /host/usr/bin/coreos-host-overlay install <name> <rootfs> tool that assumes the host rootfs is mounted at /host and handles the case where e.g. a container delivering host binaries is upgraded (or removed) before reboot.

(This problem set quickly generalizes of course to something like "transient RPMs")

cgwalters commented 4 years ago

To be clear, the result of this could be some documentation; or it could be code. I think if we do nothing though, people are going to do manifestly bad things.

cgwalters commented 4 years ago

Yet another reason I'd like these binaries in /run is that eventually I'd like to have the binaries that come with the host be signed and do something like enforce that any privileged code executed from a persistent storage comes from signed binaries.

danwinship commented 4 years ago

See also https://issues.redhat.com/browse/SDN-695 for some older thoughts specifically on the deploying-CNI-plugins angle. We definitely need something simpler than what we're doing now.

oh... this is an FCOS bug and that link probably isn't public. Well, the suggestion was to have some sort of config somewhere with like

cniPlugins:
  - name: cni-default
    sourceImage: quay.io/openshift/origin-container-networking-plugins:4.3
    rhel7Plugins:
      - /usr/src/plugins/rhel7/bin/*
    rhel8Plugins:
      - /usr/src/plugins/rhel8/bin/*
    plugins:
      - /usr/src/plugins/bin/*
  - name: openshift-sdn
    sourceImage: quay.io/openshift/origin-sdn:4.3
    plugins: 
      - /opt/cni/bin/openshift-sdn
  - name: multus
    ...

and then something would know how to pull the binaries out of those images and ensure they got installed correctly.

re /run/bin, there is trickiness with how CNI works which is terrible but we may need to have the multus binary be in its own directory without any other binaries in it (to avoid confusing cri-o about when the CNI plugin is ready), and we need all of the other CNI plugins to be in a directory without any non-CNI-plugin binaries (to avoid privilege escalation via multus). So anyway, we may need /run/multus/bin and /run/cni/bin or /run/bin/multus/ and /run/bin/cni/

cgwalters commented 4 years ago

Yeah, /run/multus/bin is fine too.

cgwalters commented 4 years ago

One thing we could do to generalize this is to have first-class support in ostree (and rpm-ostree) for transient package installs; this is strongly related to live updates except here we'd want the package install to skip the "persistence" step.

On the libostree side it'd be like ostree admin unlock a bit except we'd still keep /usr as a read-only bind mount. On the rpm-ostree side we'd need to more carefully keep track of the transient vs dynamic state; would likely involve two "origin" files, one in /run.

This would allow e.g. CNI to use what appears to be /usr/bin/cni or whatever, except it'd actually be on a tmpfs and go away on reboot.

cgwalters commented 4 years ago

mrunalp suggested using podman mount for these cases, which would be really nice except we'd need to figure out SELinux labeling. Maybe we could force podman to mount everything as bin_t or so and just assume they're all binaries.

smarterclayton commented 4 years ago

Re: the config file, that file has to be dynamic. Who generates it?

danwinship commented 4 years ago

Yeah, I don't think it could actually be a single config file. It would be more like, the CNO takes every object of a certain CRD type and combines them together to generate the list of plugins to install.

In particular, one of the use cases was that we wanted it to be easier for third-party network plugins to install their CNI binaries without needing to know what directories we've configured cri-o to use, so in that case, no OpenShift component would know what plugin it wanted to install, so there couldn't be a single config file generated by an OpenShift component.

danwinship commented 4 years ago

(The original idea partly quoted above was that everyone would just add elements to an array in the network.config.openshift.io object, but that would be hard to coordinate well, and we don't want the admin to be able to accidentally break things by removing the wrong elements anyway)

c3d commented 4 years ago

Just tom make sure I understand the idea correctly:

The privileged container sees mounts for some part of the host filesystem, e.g. /run as /host/run, and then copies some content there, e.g. /host/run/some-product/bin/some-product
The host then executes /run/some-product/bin/some-product and expects to locally find a suitable set of libraries, etc

I may be wrong, but I think that things would have increased chances of working right (e.g. the RHEL7 vs RHEL8 openssl binary example) if you added the libraries, i.e. if you also had a /run/some-product/lib that would be added to the LD_LIBRARY_PATH before running the binary.

Going further, and staying in a "container" spirit, I believe that you would practically need to do a chroot to /run/some-product and run the binary from there. Of course, that means you need to expose the relevant files in that chroot. But that means you would not need to copy to some other host location, and would solve the issue of removing the files when you remove the container.

So in the end, I think that what confuses me is:

Either you want the content to be transient and be removed if the container is removed, in which case it looks like running within the container environment itself, but possibly without dropping capabilities, etc, might be the safest route wrt. existing container build and testing practice (i.e. hiding details of the host OS like what particular version of openssl libraries are there)
Or you want the content to be persistent, which means the container is used to install something on the host, in which case you specifically don't want the files to be removed if the container is removed. However, I don't know how you would remove such files, except by putting them in a transient location (that could be /run) and cleaning up only on reboot.

So maybe to help me understand better, how do you see things running:

Executing the container copies the files to the host, and the copied files remain available even after you remove the image
Installing the container image provides the files, running the container "activates" the files, e.g. by modifying a search path to add the container image, or by adding host links into the container filesystem
The container image provides the file in a standard way, and the container does not even need to run for the host to know where to find the files to execute

cgwalters commented 4 years ago

Going further, and staying in a "container" spirit, I believe that you would practically need to do a chroot to /run/some-product and run the binary from there.

No. This model is intended for binaries that need to execute in the host's mount namespace or otherwise interact with things that run on the host. For example, kubelet, etc. No one should use chroot() in 2020 - use a real container runtime if your binaries can be containerized.

danwinship commented 4 years ago

Basically, this is not 'in a "container" spirit'; this is abusing containers as a distribution mechanism for host binaries.

cgwalters commented 4 years ago

I may be wrong, but I think that things would have increased chances of working right (e.g. the RHEL7 vs RHEL8 openssl binary example)

You're right about this; basically what one needs to do here if one wants to support multiple operating systems is to have e.g. a rhel7/ and rhel8/ subdirectory in your container, inspect the host's /etc/os-release and copy out the right binaries. That's what we ended up doing for the SDN. It definitely gets more complicated if you have library dependencies not in the host, but the hope is one avoids that. But, using LD_LIBRARY_PATH would work too.

That said, see https://github.com/coreos/fedora-coreos-tracker/issues/354#issuecomment-591417441 for the "use RPMs" path.

cgwalters commented 4 years ago

This also relates to https://github.com/coreos/fedora-coreos-tracker/issues/401 which is more about persistent extensions - things one needs when the host boots, before kubelet, etc.

coreos / fedora-coreos-tracker

Best practices for delivering container content that executes on the host #354