containers / toolbox

Tool for interactive command line environments on Linux
https://containertoolbx.org/
Apache License 2.0
2.39k stars 208 forks source link

fedora-toolbox:39 has a dangling /etc/resolv.conf symlink on hosts without systemd-resolved(8) #1410

Closed ianw closed 6 months ago

ianw commented 7 months ago

The 39 (and 40) tag appears to have /etc/resolv.conf in the container as a dangling symlink to systemd/resolve/stub-resolv.conf

# in 39 toolbox
$ podman run --dns=none -it registry.fedoraproject.org/fedora-toolbox:39 /bin/bash -c "ls -l /etc/resolv.conf; rpm -qa | grep systemd-resolv"
lrwxrwxrwx. 1 root root 39 Nov  7 07:55 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
systemd-resolved-254.5-2.fc39.x86_64

# not in 38 toolbox
$ podman run --dns=none -it registry.fedoraproject.org/fedora-toolbox:38 /bin/bash -c "ls -l /etc/resolv.conf; rpm -qa | grep systemd-resolv"
-rw-r--r--. 1 root root 21 Oct  6 06:48 /etc/resolv.conf

# still in 40 toolbox
$ podman run --dns=none -it registry.fedoraproject.org/fedora-toolbox:40 /bin/bash -c "ls -l /etc/resolv.conf; rpm -qa | grep systemd-resolv"
lrwxrwxrwx. 1 root root 39 Nov 20 07:34 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
systemd-resolved-254.5-2.fc40.x86_64

This breaks name resolution on toolbox entry for people on systems like rhel9 that are not using systemd-resolved

ianw commented 7 months ago

I imagine this is also the cause of the problems in https://github.com/containers/toolbox/issues/1406 although it's not specified what system the toolbox is being run on there

debarshiray commented 7 months ago

This looks like a regression from the Fedora 39 Change that started building the fedora-toolbox images as part of the nightly composes, and ported them from a Dockerfile to fedora-kickstarts and pungi-fedora.

Prior to that, the fedora-toolbox images had a simple /etc/resolv.conf:

[rishi@topinka ~]$ podman run --rm --interactive --tty --env TERM=$TERM registry.fedoraproject.org/fedora-toolbox:38 /bin/bash
[root@cf4e0ff025dc /]# ls -l /etc/resolv.conf
-rw-r--r--. 1 root root 100 Dec  1 16:10 /etc/resolv.conf
[root@cf4e0ff025dc /]# cat /etc/resolv.conf
search redhat.com
nameserver 10.0.2.3
nameserver 192.168.0.1
nameserver fe80::7a98:e8ff:fe55:4180%4
[root@cf4e0ff025dc /]# 

... that got turned into a symbolic link to the host's /etc/resolv.conf (at /run/host/etc/resolv.conf) by the Toolbx container's entry point.

The logic in the entry point looks like this:

if _, err := os.Readlink("/etc/resolv.conf"); err != nil {
    if err := redirectPath("/etc/resolv.conf", "/run/host/etc/resolv.conf", false); err != nil {
        return err
    }
}

Unfortunately, os.Readlink doesn't return an error if the target of the symlink is absent, which is what leads to this situation. While we fix this, you can fix your own containers by making sure that the container's /etc/resolv.conf points to /run/host/etc/resolv.conf.

debarshiray commented 7 months ago

I added some tests to ensure that DNS works inside the containers: https://github.com/containers/toolbox/pull/1414

It's not as exhaustive as I would like it to be, because we aren't currently running all the tests on CentOS Stream 9 and Ubuntu 22.10. However, it's a start.

We should also probably check /etc/resolv.conf itself, but that would require this bug to be fixed first.

debarshiray commented 7 months ago

One way to fix this bug would be to always reset the container's /etc/resolv.conf to a known good value on container start. This could cause problems in cases where the user deliberately changed it inside the container, because the customization will get overwritten whenever the container is restarted.

However, we don't preserve such modifications elsewhere either, so this won't make it any worse.

I suppose we need to introduce some sort of a stamp file to mark a container as already initialized, and to not reset any configuration changes made by the user.

ianw commented 7 months ago

This looks like a regression from the Fedora 39 Change that started building the fedora-toolbox images as part of the nightly composes, and ported them from a Dockerfile to fedora-kickstarts and pungi-fedora.

Prior to that, the fedora-toolbox images had a simple /etc/resolv.conf:

One thing that we (we being mostly @wackrat :) is that this symlink is being created by systemd-resolved which appears to be a weak dependency of systemd ... so I think this was not being brought in until a recent change https://pagure.io/fedora-kickstarts/c/49306cb6eada8777eafc2fa7f93f16008c2e93a5?branch=main which starts to pull in weak dependencies.

I don't think that systemd-resolved is required in the container at all? Name resolution seems like a thing for the host. Perhaps this package should be put on a block list?

debarshiray commented 7 months ago

One thing that we (we being mostly @wackrat :) is that this symlink is being created by systemd-resolved which appears to be a weak dependency of systemd ... so I think this was not being brought in until a recent change https://pagure.io/fedora-kickstarts/c/49306cb6eada8777eafc2fa7f93f16008c2e93a5?branch=main which starts to pull in weak dependencies.

Yes, that's exactly it! I didn't notice that we are explicitly pulling in systemd through the fedora-container-toolbox.ks Kickstart.

It's a mistake that crept in during the move to build the images as part of the nightly composes. Earlier, systemd was only one of those packages that we re-installed, if they were already part of the fedora base image, because those had their documentation and translations stripped out.

Last time, systemd was part of the fedora base image was Fedora 36, and even though it pulled in systemd-resolved, the /etc/resolv.conf symbolic link was correct.

Anyway, do you want to submit a pull request to fedora-kickstarts to fix the fedora-toolbox image? Something like:

$ git diff
diff --git a/fedora-container-toolbox.ks b/fedora-container-toolbox.ks
index 89e8ee9d38b9..5e86e5f4cba7 100644
--- a/fedora-container-toolbox.ks
+++ b/fedora-container-toolbox.ks
@@ -82,7 +82,7 @@ shadow-utils
 -shared-mime-info
 -sssd-client
 sudo
-systemd
+-systemd-resolved
 tar # https://bugzilla.redhat.com/show_bug.cgi?id=1409920
 tcpdump
 time

I don't think that systemd-resolved is required in the container at all? Name resolution seems like a thing for the host. Perhaps this package should be put on a block list?

Yes, you are right. We point (or at least try to) the container's /etc/resolv.conf to the host's version of the file, anyway.

debarshiray commented 7 months ago

One way to fix this bug would be to always reset the container's /etc/resolv.conf to a known good value on container start. This could cause problems in cases where the user deliberately changed it inside the container, because the customization will get overwritten whenever the container is restarted.

However, we don't preserve such modifications elsewhere either, so this won't make it any worse.

Apart from fixing the fedora-toolbox image, I think we should also make the entry point more aggressive in resetting the container's /etc/resolv.conf to a known good value on container start. Ultimately, toolbox(1) can be used with so many unknown images out there in the wild that we can't possibly ensure that all the images are correct in every detail.

Pull requests welcome. :)

ianw commented 7 months ago

Anyway, do you want to submit a pull request to fedora-kickstarts to fix the fedora-toolbox image?

https://pagure.io/fedora-kickstarts/pull-request/1010

ianw commented 7 months ago

Apart from fixing the fedora-toolbox image, I think we should also make the entry point more aggressive in resetting the container's /etc/resolv.conf to a known good value on container start. Ultimately, toolbox(1) can be used with so many unknown images out there in the wild that we can't possibly ensure that all the images are correct in every detail.

I see that the container starts with --dns=none (https://github.com/containers/toolbox/blob/9ec3f36908244fd086e9916902956caeaf516f26/src/cmd/create.go#L399) meaning podman doesn't get in the way.

The current logic seems to boil down to

I think the assumption we broke here was that it in the default case, there was a symlink so toolbox assumed it was setup and left it alone?

I think perhaps asserting in the container that the resolv.conf is a plain file might be enough?

debarshiray commented 7 months ago

Anyway, do you want to submit a pull request to fedora-kickstarts to fix the fedora-toolbox image?

https://pagure.io/fedora-kickstarts/pull-request/1010

Thanks, @ianw !

The current logic seems to boil down to

if it's a file in the container; symlink it to the bind mount of the host /etc/resolv.conf if it's a symlink in the container, leave it alone; based on the assumption somebody configured this

Yes, correct.

I think the assumption we broke here was that it in the default case, there was a symlink so toolbox assumed it was setup and left it alone?

Yes, correct.

I think perhaps asserting in the container that the resolv.conf is a plain file might be enough?

Umm... what do you mean by asserting? You mean leaving the current logic as it is?

I was considering making the current logic more aggressive like this:

if resolvConfTarget, err := os.Readlink("/etc/resolv.conf"); err != nil || resolvConfTarget != "/run/host/etc/resolv.conf" {
    if err := redirectPath("/etc/resolv.conf", "/run/host/etc/resolv.conf", false); err != nil {
        return err
    }
}

This has the downside of overwriting some custom user-made modifications to the container's /etc/resolv.conf. However, I think that we need a more explicit way to support those. One idea is to have the entry point create a stamp file for the container after the first run that will survive container restarts. As long as this stamp file is present, subsequent runs of the entry point won't touch the configuration.

On the plus side, being more aggressive with enforcing the known good configuration will protect us from the subtleties of unknown host and image combinations. We can defend against the unknown by further improving our CI by running the full test suite on more host operating systems (eg., we only run a few tests on CentOS Stream 9 and Ubuntu 22.04 today), but there are only so many combinations that we will be able to test.

debarshiray commented 6 months ago

@ianw Does this fix the problem when using the faulty fedora-toolbox images on RHEL 9 hosts: https://github.com/containers/toolbox/pull/1425 ? It seemed to work for me, but a confirmation will be good. :)

debarshiray commented 6 months ago

The fedora-toolbox:40 image has now been fixed, and toolbox(1) itself has been made more resilient.

Once this has gotten some more testing, we can arrange for some backports for the fedora-toolbox:39 image.

debarshiray commented 6 months ago

Thanks for your help getting this fixed, @ianw !

dmitpv commented 6 months ago

fedora-toolbox-39 is work! ver: toolbox-0.0.99.5-1. Thank you!

vwbusguy commented 5 months ago

I hit this on Fedora 39 today. Looks like it might be fixed in rawhide, but not yet for 39. https://bugzilla.redhat.com/show_bug.cgi?id=2258648