Open cgwalters opened 4 years ago
To be clear, the result of this could be some documentation; or it could be code. I think if we do nothing though, people are going to do manifestly bad things.
Yet another reason I'd like these binaries in /run
is that eventually I'd like to have the binaries that come with the host be signed and do something like enforce that any privileged code executed from a persistent storage comes from signed binaries.
See also https://issues.redhat.com/browse/SDN-695 for some older thoughts specifically on the deploying-CNI-plugins angle. We definitely need something simpler than what we're doing now.
oh... this is an FCOS bug and that link probably isn't public. Well, the suggestion was to have some sort of config somewhere with like
cniPlugins:
- name: cni-default
sourceImage: quay.io/openshift/origin-container-networking-plugins:4.3
rhel7Plugins:
- /usr/src/plugins/rhel7/bin/*
rhel8Plugins:
- /usr/src/plugins/rhel8/bin/*
plugins:
- /usr/src/plugins/bin/*
- name: openshift-sdn
sourceImage: quay.io/openshift/origin-sdn:4.3
plugins:
- /opt/cni/bin/openshift-sdn
- name: multus
...
and then something would know how to pull the binaries out of those images and ensure they got installed correctly.
re /run/bin
, there is trickiness with how CNI works which is terrible but we may need to have the multus binary be in its own directory without any other binaries in it (to avoid confusing cri-o about when the CNI plugin is ready), and we need all of the other CNI plugins to be in a directory without any non-CNI-plugin binaries (to avoid privilege escalation via multus). So anyway, we may need /run/multus/bin
and /run/cni/bin
or /run/bin/multus/
and /run/bin/cni/
Yeah, /run/multus/bin
is fine too.
One thing we could do to generalize this is to have first-class support in ostree (and rpm-ostree) for transient package installs; this is strongly related to live updates except here we'd want the package install to skip the "persistence" step.
On the libostree side it'd be like ostree admin unlock
a bit except we'd still keep /usr
as a read-only bind mount. On the rpm-ostree side we'd need to more carefully keep track of the transient vs dynamic state; would likely involve two "origin" files, one in /run
.
This would allow e.g. CNI to use what appears to be /usr/bin/cni
or whatever, except it'd actually be on a tmpfs and go away on reboot.
mrunalp suggested using podman mount
for these cases, which would be really nice except we'd need to figure out SELinux labeling. Maybe we could force podman to mount everything as bin_t
or so and just assume they're all binaries.
Re: the config file, that file has to be dynamic. Who generates it?
Yeah, I don't think it could actually be a single config file. It would be more like, the CNO takes every object of a certain CRD type and combines them together to generate the list of plugins to install.
In particular, one of the use cases was that we wanted it to be easier for third-party network plugins to install their CNI binaries without needing to know what directories we've configured cri-o to use, so in that case, no OpenShift component would know what plugin it wanted to install, so there couldn't be a single config file generated by an OpenShift component.
(The original idea partly quoted above was that everyone would just add elements to an array in the network.config.openshift.io
object, but that would be hard to coordinate well, and we don't want the admin to be able to accidentally break things by removing the wrong elements anyway)
Just tom make sure I understand the idea correctly:
/run
as /host/run
, and then copies some content there, e.g. /host/run/some-product/bin/some-product
/run/some-product/bin/some-product
and expects to locally find a suitable set of libraries, etcI may be wrong, but I think that things would have increased chances of working right (e.g. the RHEL7 vs RHEL8 openssl binary example) if you added the libraries, i.e. if you also had a /run/some-product/lib
that would be added to the LD_LIBRARY_PATH
before running the binary.
Going further, and staying in a "container" spirit, I believe that you would practically need to do a chroot
to /run/some-product
and run the binary from there. Of course, that means you need to expose the relevant files in that chroot. But that means you would not need to copy to some other host location, and would solve the issue of removing the files when you remove the container.
So in the end, I think that what confuses me is:
Either you want the content to be transient and be removed if the container is removed, in which case it looks like running within the container environment itself, but possibly without dropping capabilities, etc, might be the safest route wrt. existing container build and testing practice (i.e. hiding details of the host OS like what particular version of openssl libraries are there)
Or you want the content to be persistent, which means the container is used to install something on the host, in which case you specifically don't want the files to be removed if the container is removed. However, I don't know how you would remove such files, except by putting them in a transient location (that could be /run
) and cleaning up only on reboot.
So maybe to help me understand better, how do you see things running:
Going further, and staying in a "container" spirit, I believe that you would practically need to do a chroot to /run/some-product and run the binary from there.
No. This model is intended for binaries that need to execute in the host's mount namespace or otherwise interact with things that run on the host. For example, kubelet, etc. No one should use chroot()
in 2020 - use a real container runtime if your binaries can be containerized.
Basically, this is not 'in a "container" spirit'; this is abusing containers as a distribution mechanism for host binaries.
I may be wrong, but I think that things would have increased chances of working right (e.g. the RHEL7 vs RHEL8 openssl binary example)
You're right about this; basically what one needs to do here if one wants to support multiple operating systems is to have e.g. a rhel7/ and rhel8/ subdirectory in your container, inspect the host's /etc/os-release
and copy out the right binaries. That's what we ended up doing for the SDN. It definitely gets more complicated if you have library dependencies not in the host, but the hope is one avoids that. But, using LD_LIBRARY_PATH
would work too.
That said, see https://github.com/coreos/fedora-coreos-tracker/issues/354#issuecomment-591417441 for the "use RPMs" path.
This also relates to https://github.com/coreos/fedora-coreos-tracker/issues/401 which is more about persistent extensions - things one needs when the host boots, before kubelet, etc.
We emphasize containers, but there are needs to execute code on the host. Package layering is one; it has some advantages and major disadvantages.
In OpenShift today, we have a pattern of privileged containers that lay down and execute some binaries on the host. This avoids the reboots inherent in layering (also doesn't require RPMs).
Recently however, this failed because running binaries targeting e.g. RHEL7 on a RHEL8 host may fail if they link to e.g. openssl. A general best practice here is really that the binaries need to be targeting the same userspace as the host. With e.g. statically linked Go/Rust type code one can avoid most issues but not all (and you really want to dynamically link openssl).
See this Dockerfile which pairs with this PR.
Further, I think we should move these binaries into e.g.
/run/bin
or so - if there's a higher level process that pulls the containers to a node on reboot (e.g. a systemd unit created via Ignition, or in the OpenShift case the kubelet), then having the binaries in/run
helps ensure that if e.g. the container is removed, at least the binaries will go away on reboot.That said...we may consider even shipping something like a
/host/usr/bin/coreos-host-overlay install <name> <rootfs>
tool that assumes the host rootfs is mounted at/host
and handles the case where e.g. a container delivering host binaries is upgraded (or removed) before reboot.(This problem set quickly generalizes of course to something like "transient RPMs")