Open mrclrchtr opened 1 week ago
Interesting -- why is tini
involved here? :thinking:
Do you have something configured on your system that would be putting tini
inside that container automatically (for example, on dockerd
there's a --init
flag that would do so)?
(That being said, I can't reproduce the issue even using docker run --init
to force tini
to be the parent of my dockerd
process, so that doesn't really help much, it's just the only meaningful thread I can see to pull on :sob:)
Not that I know of... there is an earlier container that unpacks "dind-externals" from the github runner image and provides it via a volume mount for dind. But that shouldn't lead to a different startup behavior, should it?
This is the log of the v26 image:
cat: can't open '/proc/net/arp_tables_names': No such file or directory
iptables v1.8.10 (nf_tables)
time="2024-06-27T17:34:14.706370867Z" level=info msg="Starting up"
time="2024-06-27T17:34:14.711383174Z" level=info msg="containerd not running, starting managed containerd"
time="2024-06-27T17:34:14.797946949Z" level=info msg="started new containerd process" address=/var/run/docker/containerd/containerd.sock module=libcontainerd pid=346
time="2024-06-27T17:34:14.903422623Z" level=info msg="starting containerd" revision=ae71819c4f5e67bb4d5ae76a6b735f29cc25774e version=v1.7.18
...
...
I'll see if Talos has anything to do with it.
# XXX inject "docker-init" (tini) as pid1 to workaround https://github.com/docker-library/docker/issues/318 (zombie container-shim processes)
set -- docker-init -- "$@"
Oh lol, good catch -- I forgot all about that. :sob:
However, that doesn't really help give us more threads to pull because it works fine here, so my only guess is something in the Talos environment or kernel or something? Maybe something about how Kubernetes is creating the container?
Is there any way you could get lower level on the affected system and debug/test more directly with simpler container run commands like docker run
to help narrow down?
However, that doesn't really help give us more threads to pull because it works fine here, so my only guess is something in the Talos environment or kernel or something? Maybe something about how Kubernetes is creating the container?
Yes, I also think it has to do with Talos. The question is whether the error message means that sigtimedwait
is not present?
And I wonder what change to the image this function needs now?
Is there any way you could get lower level on the affected system and debug/test more directly with simpler container run commands like
docker run
to help narrow down?
No, unfortunately not. Talos is built in such a way that you can't even set up an SSH tunnel to the machine.
But I could build a very simple Kubernetes deployment with just the image. That's a good idea and helps to isolate the error.
Thank you very much for your help. I'll get back to you as soon as I have more information.
I tried to upgrade from v26 to v27.
I want to use docker dind in a github actions runner scale set with the following config:
This ist the complete log, I can get:
The underlaying OS is Talos v1.7.4
Do you have any idea, whats happening?