containers / toolbox

Tool for interactive command line environments on Linux
https://containertoolbx.org/
Apache License 2.0
2.38k stars 208 forks source link

Can't enter fedora 40 toolbx: Failed to bind /etc/machine-id #1512

Open A6GibKm opened 2 weeks ago

A6GibKm commented 2 weeks ago

When starting my Fedora 40 toolbox (on Fedora Silverblue 40) I see the message:

level=debug msg="Running as real user ID 0"
level=debug msg="Resolved absolute path to the executable as /usr/bin/toolbox"
level=debug msg="TOOLBOX_PATH is /usr/bin/toolbox"
level=debug msg="Migrating to newer Podman"
level=debug msg="Migration not needed: running inside a container"
level=debug msg="Setting up configuration"
level=debug msg="Setting up configuration: file /etc/containers/toolbox.conf not found"
level=debug msg="Setting up configuration: file /root/.config/containers/toolbox.conf not found"
level=debug msg="Resolving container and image names"
level=debug msg="Container: ''"
level=debug msg="Distribution (CLI): ''"
level=debug msg="Image (CLI): ''"
level=debug msg="Release (CLI): ''"
level=debug msg="Resolved container and image names"
level=debug msg="Container: 'fedora-toolbox-40'"
level=debug msg="Image: 'fedora-toolbox:40'"
level=debug msg="Release: '40'"
level=debug msg="Creating /run/.toolboxenv"
level=debug msg="Path /run/host/etc exists"
level=debug msg="Resolved /etc/localtime to /run/host/usr/share/zoneinfo/Europe/Vienna"
level=debug msg="Creating regular file /etc/machine-id"
level=debug msg="Binding /etc/machine-id to /run/host/etc/machine-id"
mount: /etc/machine-id: must be superuser to use mount.
       dmesg(1) may have more information after failed mount system call.
Error: failed to bind /etc/machine-id to /run/host/etc/machine-id

See

 $ ls -la /etc/machine-id 
-rw-r--r--. 1 root root 33 oct 19  2020 /etc/machine-id
$ toolbox --version
toolbox version 0.0.99.5
$ podman --version
podman version 5.1.0
$ uname -a
Linux alpha 6.9.4-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Jun 12 13:33:34 UTC 2024 x86_64 GNU/Linux
debarshiray commented 2 weeks ago

The daily CI runs, linked from README.md, include package-based Fedora 40, and they are passing.

I wonder if this is specific to Fedora Silverblue.

What happens if you do this on the host:

# touch /tmp/machine-id
# mount --rbind /etc/machine-id /tmp/machine-id
A6GibKm commented 2 weeks ago

This is not my only silverblue 40 machine, the other seems to be able to run the toolbox just fine.

$ touch /tmp/machine-id
$ mount --rbind /etc/machine-id /tmp/machine-id
mount: /tmp/machine-id: must be superuser to use mount.
       dmesg(1) may have more information after failed mount system call.
$ sudo mount --rbind /etc/machine-id /tmp/machine-id
$ toolbox enter
Error: failed to initialize container fedora-toolbox-40

I also tried the touch with sudo.

A6GibKm commented 2 weeks ago

For more context, this is the second time it happens this month. I recreated the the toolbox only a few days ago. I saw another report at one GNOME Matrix channel.

debarshiray commented 2 weeks ago

This is not my only silverblue 40 machine, the other seems to be able to run the toolbox just fine.


$ touch /tmp/machine-id
$ mount --rbind /etc/machine-id /tmp/machine-id
mount: /tmp/machine-id: must be superuser to use mount.
       dmesg(1) may have more information after failed mount system call.
$ sudo mount --rbind /etc/machine-id /tmp/machine-id

The mount(8) has to be done as root. That's why I used a # prompt in my example.

Looking at the code, /etc/machine-id is the first bind mount that the container's entry point attempts to do, and then there's this from the container's logs:

level=debug msg="Running as real user ID 0"
...
level=debug msg="Binding /etc/machine-id to /run/host/etc/machine-id"
mount: /etc/machine-id: must be superuser to use mount.
       dmesg(1) may have more information after failed mount system call.

Those two things can't be true at the same time. So, I am beginning to wonder if there's something going wrong inside mount(8). It would be revealing to prepend a call to strace(1) and then try it with a Toolbx container that includes the strace(1) binary. Something like:

$ git diff
diff --git a/src/cmd/initContainer.go b/src/cmd/initContainer.go
index de7bcfcc5302..c6108edc4135 100644
--- a/src/cmd/initContainer.go
+++ b/src/cmd/initContainer.go
@@ -724,6 +724,7 @@ func mountBind(containerPath, source, flags string) error {
        logrus.Debugf("Binding %s to %s", containerPath, source)

        args := []string{
+               "mount",
                "--rbind",
        }

@@ -733,7 +734,7 @@ func mountBind(containerPath, source, flags string) error {

        args = append(args, []string{source, containerPath}...)

-       if err := shell.Run("mount", nil, nil, nil, args...); err != nil {
+       if err := shell.Run("strace", nil, nil, nil, args...); err != nil {
                return fmt.Errorf("failed to bind %s to %s", containerPath, source)
        }

However, you need Toolbx to build Toolbx on Fedora Silverblue. So, I suppose I should put together a debug RPM.

A6GibKm commented 2 weeks ago

I just upgraded my other machine and its toolbox still works. This is very weird considering the machines are configured the same (afaik). If you prepare a rpm or binary I can try that thanks!

debarshiray commented 2 weeks ago

Just tried it with this Fedora 40 Silverblue deployment and couldn't reproduce:

Deployments:
● fedora:fedora/40/x86_64/silverblue
                  Version: 40.20240618.0 (2024-06-18T00:52:57Z)
               BaseCommit: fa68d62df2fae64e52bbfe15784915c78ab2914767cacded8c5de2f5b7ddab62
             GPGSignature: Valid signature by 115DF9AEF857853EE8445D0A0727707EA15B79CC

Just to be sure, do you have the same deployment on both your machines?

debarshiray commented 2 weeks ago

I submitted a Fedora 40 build for a debug RPM: https://koji.fedoraproject.org/koji/taskinfo?taskID=119303969

A6GibKm commented 2 weeks ago

Nah, the one broken is yesterday's (40.20240618.0 (2024-06-18T00:52:57Z)) deployment and the other machine which is working has today's. I am upgrading right now but I don't think this is it.

A6GibKm commented 2 weeks ago

Attached the output of

strace toolbox enter &> strace.txt

with the debug build. Is that enough?

strace.txt

EDIT: Note that the error is different this time? I still see

jun 19 20:47:28 alpha fedora-toolbox-40[8291]: Error: failed to bind /etc/machine-id to /run/host/etc/machine-id

in journalctl -b.

debarshiray commented 2 weeks ago

Attached the output of

strace toolbox enter &> strace.txt

We don't need to run strace(1) against toolbox enter. For that we wouldn't need a debug build.

We are running strace(1) against the mount(8) getting called inside the container from the entry point by adjusting the toolbox(1) binary. So we need to look at the strace(1) output from podman start --attach or podman logs.

A6GibKm commented 2 weeks ago

Sorry I am not sure how to get the strace from inside the container, you mean

$ strace podman start --attach fedora-toolbox-40 &> podman-attach.txt

? If so it is attached bellow.

podman-attach.txt

debarshiray commented 2 weeks ago

No need to manually attach strace(1) anywhere.

Before you install the debug build of toolbox, ensure that you have a Toolbx container with strace(1) in it.

Then, install the debug build of toolbox, stop all your containers with podman stop --all, then try to enter one with strace(1). If the error reproduces, then share with us what you have in podman start --attach ... or podman logs ....

nielsdg commented 2 weeks ago

FWIW, I can reliably reproduce my toolbox containers breaking after doing a reboot

nielsdg commented 2 weeks ago

Maybe this is related? https://discussion.fedoraproject.org/t/rpm-ostree-update-breaks-toolbox-fedora-40/120095/4

A6GibKm commented 2 weeks ago

I did a reset of my conifg. Here is the diff of the prior and newer output of podman system info

--- a   2024-06-19 23:28:25.883686898 +0200
+++ b   2024-06-19 23:28:38.536401465 +0200
@@ -13,17 +13,17 @@
     path: /usr/bin/conmon
     version: 'conmon version 2.1.10, commit: '
   cpuUtilization:
-    idlePercent: 91.02
-    systemPercent: 3.79
-    userPercent: 5.2
+    idlePercent: 92.44
+    systemPercent: 3.63
+    userPercent: 3.93
   cpus: 16
-  databaseBackend: boltdb
+  databaseBackend: sqlite
   distribution:
     distribution: fedora
     variant: silverblue
     version: "40"
   eventLogger: journald
-  freeLocks: 2047
+  freeLocks: 2048
   hostname: alpha
   idMappings:
     gidmap:
@@ -43,7 +43,7 @@
   kernel: 6.9.4-200.fc40.x86_64
   linkmode: dynamic
   logDriver: journald
-  memFree: 11161944064
+  memFree: 10079854592
   memTotal: 16673759232
   networkBackend: netavark
   networkBackendInfo:
@@ -99,7 +99,7 @@
       libseccomp: 2.5.5
   swapFree: 8589930496
   swapTotal: 8589930496
-  uptime: 0h 2m 32.00s
+  uptime: 0h 2m 18.00s
   variant: ""
 plugins:
   authorization: null
@@ -122,25 +122,25 @@
 store:
   configFile: /var/home/deathwish/.config/containers/storage.conf
   containerStore:
-    number: 1
+    number: 0
     paused: 0
     running: 0
-    stopped: 1
+    stopped: 0
   graphDriverName: overlay
   graphOptions: {}
   graphRoot: /var/home/deathwish/.local/share/containers/storage
   graphRootAllocated: 1000204886016
-  graphRootUsed: 952381968384
+  graphRootUsed: 949989421056
   graphStatus:
     Backing Filesystem: btrfs
-    Native Overlay Diff: "false"
+    Native Overlay Diff: "true"
     Supports d_type: "true"
-    Supports shifting: "true"
+    Supports shifting: "false"
     Supports volatile: "true"
     Using metacopy: "false"
   imageCopyTmpDir: /var/tmp
   imageStore:
-    number: 1
+    number: 0
   runRoot: /run/user/1000/containers
   transientStore: false
   volumePath: /var/home/deathwish/.local/share/containers/storage/volumes
debarshiray commented 1 week ago

Maybe this is related? https://discussion.fedoraproject.org/t/rpm-ostree-update-breaks-toolbox-fedora-40/120095/4

I quickly skimmed through it. On the surface it doesn't seem related to why mount(8) thinks that it's not running as root.

debarshiray commented 1 week ago

I did a reset of my conifg. Here is the diff of the prior and newer output of podman system info

Did resetting the Podman configuration reliably fix this problem?

debarshiray commented 1 week ago

FWIW, I can reliably reproduce my toolbox containers breaking after doing a reboot

Okay, that's great. Are you in a position to get the strace(1) logs using the debug build, like I described above? If things are really badly broken, then I can come up with other steps. :)

A6GibKm commented 1 week ago

Not at home atm, but no. I was not able to create new toolboxes. I will check in more detail later today

debarshiray commented 1 week ago

I was not able to create new toolboxes.

Why? What was the exact problem?

If you can't enter a container to install strace, then you can create a custom image using a Container/Dockerfile like this:

FROM registry.fedoraproject.org/fedora:40
RUN dnf --assumeyes install strace

... followed by:

$ podman build --squash --tag localhost/strace-toolbox:40 /path/to/dir/with/Containerfile

Then you can create a container from this image:

$ toolbox create --image localhost/strace-toolbox:40

Then you can try to enter it with the debug toolbox RPM above and see what shows up in podman start --attach or podman logs.

A6GibKm commented 1 week ago

I was able to enter that container without any issues so there was nothing to strace :(. By the way, after removing the debug build of toolbox I am able to create and enter new toolboxes (After the podman system reset).

alistair23 commented 1 week ago

For more context, this is the second time it happens this month. I recreated the the toolbox only a few days ago. I saw another report at one GNOME Matrix channel.

The exact same thing happened to me. Just re-built all the containers and now can't enter them again

alistair23 commented 1 week ago

I was not able to create new toolboxes.

Why? What was the exact problem?

If you can't enter a container to install strace, then you can create a custom image using a Container/Dockerfile like this:

FROM registry.fedoraproject.org/fedora:40
RUN dnf --assumeyes install strace

... followed by:

$ podman build --squash --tag localhost/strace-toolbox:40 /path/to/dir/with/Containerfile

Then you can create a container from this image:

$ toolbox create --image localhost/strace-toolbox:40

Then you can try to enter it with the debug toolbox RPM above and see what shows up in podman start --attach or podman logs.

I tried to follow this, but newly created images work. It's just existing ones I can't enter.

For the last month it seems like the container images need to be rebuilt after every reboot on my Kinoite system

nielsdg commented 3 days ago

(Un)fortunately, I can't reproduce this anymore after doing a Silverblue update and resetting my containers as recommended in that previous link. So I can't really help with this anymore, but hey, at least things work again :-)