after upgrade from v26 to v27 dind fails to start: Unexpected error in sigtimedwait: 'Function not implemented'

docker-library / docker

Docker Official Image packaging for Docker

Apache License 2.0

1.09k stars 564 forks source link

after upgrade from v26 to v27 dind fails to start: Unexpected error in sigtimedwait: 'Function not implemented' #503

Open mrclrchtr opened 1 week ago

mrclrchtr commented 1 week ago

I tried to upgrade from v26 to v27.

I want to use docker dind in a github actions runner scale set with the following config:

image: docker:27.0.2-dind
name: dind
securityContext:
  privileged: true
env:
  - name: DOCKER_GROUP_GID
    value: "123"
resources:
  requests:
    cpu: 300m
    memory: 500Mi
  limits:
    cpu: 300m
    memory: 500Mi
args:
  - dockerd
  - --host=unix:///var/run/docker.sock
  - --group=$(DOCKER_GROUP_GID)

This ist the complete log, I can get:

cat: can't open '/proc/net/arp_tables_names': No such file or directory
iptables v1.8.10 (nf_tables)
[FATAL tini (1)] Unexpected error in sigtimedwait: 'Function not implemented'

The underlaying OS is Talos v1.7.4

Do you have any idea, whats happening?

tianon commented 1 week ago

Interesting -- why is tini involved here? :thinking:

Do you have something configured on your system that would be putting tini inside that container automatically (for example, on dockerd there's a --init flag that would do so)?

(That being said, I can't reproduce the issue even using docker run --init to force tini to be the parent of my dockerd process, so that doesn't really help much, it's just the only meaningful thread I can see to pull on :sob:)

mrclrchtr commented 1 week ago

Not that I know of... there is an earlier container that unpacks "dind-externals" from the github runner image and provides it via a volume mount for dind. But that shouldn't lead to a different startup behavior, should it?

This is the log of the v26 image:

cat: can't open '/proc/net/arp_tables_names': No such file or directory
iptables v1.8.10 (nf_tables)
time="2024-06-27T17:34:14.706370867Z" level=info msg="Starting up"
time="2024-06-27T17:34:14.711383174Z" level=info msg="containerd not running, starting managed containerd"
time="2024-06-27T17:34:14.797946949Z" level=info msg="started new containerd process" address=/var/run/docker/containerd/containerd.sock module=libcontainerd pid=346
time="2024-06-27T17:34:14.903422623Z" level=info msg="starting containerd" revision=ae71819c4f5e67bb4d5ae76a6b735f29cc25774e version=v1.7.18
...
...

I'll see if Talos has anything to do with it.

mrclrchtr commented 1 week ago

I found this: https://github.com/docker-library/docker/blob/c0963f96ace4f48d13385cbf20356ae605edcb8b/27/dind/dockerd-entrypoint.sh#L143C2-L144C28

# XXX inject "docker-init" (tini) as pid1 to workaround https://github.com/docker-library/docker/issues/318 (zombie container-shim processes)
set -- docker-init -- "$@"

tianon commented 1 week ago

Oh lol, good catch -- I forgot all about that. :sob:

However, that doesn't really help give us more threads to pull because it works fine here, so my only guess is something in the Talos environment or kernel or something? Maybe something about how Kubernetes is creating the container?

Is there any way you could get lower level on the affected system and debug/test more directly with simpler container run commands like docker run to help narrow down?

mrclrchtr commented 1 week ago

However, that doesn't really help give us more threads to pull because it works fine here, so my only guess is something in the Talos environment or kernel or something? Maybe something about how Kubernetes is creating the container?

Yes, I also think it has to do with Talos. The question is whether the error message means that sigtimedwait is not present?

And I wonder what change to the image this function needs now?

Is there any way you could get lower level on the affected system and debug/test more directly with simpler container run commands like docker run to help narrow down?

No, unfortunately not. Talos is built in such a way that you can't even set up an SSH tunnel to the machine.

But I could build a very simple Kubernetes deployment with just the image. That's a good idea and helps to isolate the error.

Thank you very much for your help. I'll get back to you as soon as I have more information.