docker / for-linux

Docker Engine for Linux
https://docs.docker.com/engine/installation/
751 stars 84 forks source link

Error response from daemon: failed to listen to abstract unix socket "/containerd-shim/moby/<uuid>/shim.sock": listen unix /containerd-shim/moby/<uuid>/shim.sock: bind: address already in use: unknown #643

Open kolbitsch-lastline opened 5 years ago

kolbitsch-lastline commented 5 years ago

I run containers using the "restart always" policy, but in some situations (the trigger is unclear to me at this point), a subset of containers fail to be restarted by the docker daemon.

In this example, I have a bunch of services that all have (almost) identical configs, and a random subset of service container are suddenly down (after days of running fine). :

root@analyst:~# docker ps | grep worker.000.802
root@analyst:~# docker ps -a | grep worker.000.802
6d504138f7f7        <my-image>    "/entrypoint.sh"         4 weeks ago          Exited (255) 8 days ago                                                                                                                                                                                                                                                                             worker-000-802_1

other container instances of the service are running fine (and are restarted every once in a while):

root@analyst:~# docker ps | grep worker.000.801
832f53c0f4ce        <my-image>    "/entrypoint.sh"         4 weeks ago          Up 28 minutes                                                                                                                                                                                                                                                                                 worker-000-801_1

When I try (for testing) to manually restart the container that the daemon failed to restart automatically, this fails:

root@analyst:~# docker start 6d504138f7f7
Error response from daemon: failed to listen to abstract unix socket "/containerd-shim/moby/6d504138f7f7ddcd57437006a3a6e70ec4c8ed32c08b5969d788f24eef28f51f/shim.sock": listen unix /containerd-shim/moby/6d504138f7f7ddcd57437006a3a6e70ec4c8ed32c08b5969d788f24eef28f51f/shim.sock: bind: address already in use: unknown
Error: failed to start containers: 6d504138f7f7

Investigating the problem, I found that the unix socket mentioned above does not exist on the file-system, but the error message says "already in use", so I searched via lsof:

root@analyst:~# lsof -U | grep 6d504138f7f7ddcd57437006a3a6e70ec4c8ed32c08b5969d788f24eef28f51f
docker-co 37032            root    3u  unix 0xffff88030db67800      0t0 502614215 @/containerd-shim/moby/6d504138f7f7ddcd57437006a3a6e70ec4c8ed32c08b5969d788f24eef28f51f/shim.sock
docker-co 37032            root    6u  unix 0xffff880dd67da1c0      0t0 323429479 @/containerd-shim/moby/6d504138f7f7ddcd57437006a3a6e70ec4c8ed32c08b5969d788f24eef28f51f/shim.sock

so, indeed the socket is in use, but not on the file-system... which makes me wonder if the process (PID 37032) actually removed it, but didn't properly close it (yet?) while shutting down?

stracing the process shows that it's currently waiting on a mutex:

root@analyst:~# strace -p 37032
Process 37032 attached
futex(0x7fd008, FUTEX_WAIT, 0, NULL

with no other behavior.

To test further, I decided to kill the process that's supposed to provide the unix socket, and now I can start the container successfully:

root@analyst:~# kill 37032

root@analyst:~# docker start 6d504138f7f7
6d504138f7f7
root@analyst:~# docker ps -a | grep worker.000.802
6d504138f7f7        <my-image>    "/entrypoint.sh"         4 weeks ago          Up 3 seconds                                                                                                                                                                                                                                                                                        worker-000-802_1

Expected behavior

Docker restart policy "always" always restarts a container.

Actual behavior

Docker restart policy "always" randomly fails after a service has been running longer periods of time (maybe because containerd does not correctly terminate/release the unix-socket)

Steps to reproduce the behavior

I have not been able to trigger the problem in a reproducible way, but I have seen dozens of instances over weeks of running services. Interestingly it happens on different services that are using completely unrelated images (aside that they have a common Debian-based base-image)

Output of docker version:

Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        5f88b8b
 Built:             Fri Sep 28 15:50:02 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       5f88b8b
  Built:            Fri Sep 28 15:49:28 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 51
 Running: 50
 Paused: 0
 Stopped: 1
Images: 7
Server Version: 18.06.1-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 134
 Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 3.13.0-157-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 62.87GiB
Name: analyst
ID: AKNM:4XYS:MIJI:G2E6:5DRO:MP2I:Q2MY:CXPE:WDJW:MI4D:WS32:O3ON
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Physical host , under constant and high load. The containers that show the problem have memory limits in place using docker-compose:

version: "2.4"

services:
  worker-000-801:
    image: "<my-image>"
    network_mode: "host"
...
    mem_limit: 4294967296

Note that I'm using a private container registry, which is why I decided to replace the image data with my-image.

The only potentially-related bug I managed to find online is this:

https://github.com/moby/moby/issues/38726

1028866041 commented 5 years ago

I've got the same issue in my project...

slhck commented 5 years ago

Same problem here with Docker updates that don't re-start the containers.

Running docker rm for the affected containers and re-creating them works, but is not very ideal.

Edit: Gave myself a +1 a few months later because I had the same issue and found my own answer as a solution…

sergiomafra commented 5 years ago

Same problem here with Docker updates that don't re-start the containers.

Running docker rm for the affected containers and re-creating them works but is not very ideal.

I've got the same problem here and I can't remove it, unfortunately.

ravensorb commented 5 years ago

Any update one this? I too am hitting it

sergiomafra commented 5 years ago

Any update one this? I too am hitting it

The solution that worked for me was to destroy the container and create a new one with the same volume from the old one.

windli2018 commented 5 years ago

Try find the docker process and kill it will resolve the issue:

for pid in $(ps -ef|grep docker|awk '{print $2}'); do
lsof -p $pid|grep $uuid
done
kill -9 <pid>  
chenz-svsarrazin commented 5 years ago

I also have this problem whenever the docker package in Ubuntu gets updated. Not sure whether this is a problem with the packaging or with docker itself.

limadm commented 5 years ago

@chenz-svsarrazin Thanks! An apt update + upgrade worked and I didn't need to recreate the container. (Ubuntu 18.04.2 LTS + Docker version 18.09.7, build 2d0083d).

Rich43 commented 5 years ago

Reproduced on Ubuntu 18.10, not sure what caused it but all my servers/containers randomly went down. Could have been an update.

slhck commented 5 years ago

There was the 18.09.7 update a few days ago (security update) which restarted the Docker service and, for me, brought down four web servers and corrupted one database. Regular start-up didn't work due to these errors.

airomyas commented 5 years ago

kill -9 $(netstat -lnp |grep containerd-sh |awk '{print $9}'|cut -d \/ -f 1)

sanekmihailow commented 5 years ago

I tryed kill pid and downgrade docker but it don't help me My solution: 1) upgrade docker from 18.0.9.2 to 18.09.7 (build 2d0083d) 2) upgrade docker-compose to 1.24.0 3) and i start container but recieve same error:

# docker start freepbx
...
bind: address already in use: unknown
Error: failed to start containers: freepbx

4) then i reboot After reboot at me the error is gone and container start

ajay-awachar commented 5 years ago

Solution which worked for me.

  1. Reboot the node/server
  2. Restart the docker service
  3. Start the docker container
Mattie112 commented 4 years ago

We just had the same issue after updating the docker package to version docker-ce-19.03.3-3.el7.x86_64. On CentOS Linux release 7.7.1908 (Core).

Exactly the same as in the first post, however killing the docker pid did not work for us. A docker restart did not work for us. A reboot of the entire server solved the problem.

Any more on this issue? It is really scary that this can happen to our production services.

dasDaniel commented 4 years ago

same issue, couldn't run after update

Error: Cannot start service odfenode: failed to listen to abstract unix socket

Version

Client: Docker Engine - Community
 Version:           19.03.3
 API version:       1.40
 Go version:        go1.12.10
 Git commit:        a872fc2
 Built:             Tue Oct  8 00:59:54 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.3
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.10
  Git commit:       a872fc2
  Built:            Tue Oct  8 00:58:28 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.6
  GitCommit:        894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
Granjow commented 4 years ago

Same issue here.

In my case, it helped to downgrade to an earlier docker version and then restart the system (just restarting docker did not help). No need to redeploy/remove existing containers.

Example for Ubuntu Xenial:

sudo apt install docker-ce=5:19.03.1~3-0~ubuntu-xenial
Redsandro commented 4 years ago

I am seeing the same issue after updating to Docker version 19.03.4. I cannot reboot this Debian machine without a lot of hassle. I wish I hadn't upgraded Docker.

Captain Hindsight advise: Pin the docker version.

You wouldn't expect this from the non-edge channel.

ducmanhnguyen commented 4 years ago

Try find the docker process and kill it will resolve the issue:

for pid in $(ps -ef|grep docker|awk '{print $2}'); do
lsof -p $pid|grep $uuid
done
kill -9 <pid>  

this one saved my day! thanks

cpuguy83 commented 4 years ago

@thaJeztah Is this line same as other issues where it was some packaging related problem?

slimsag commented 4 years ago

This SO post seems to indicate this may be an issue with the Ubuntu Snap package and that the following may resolve it:

# Remove snap installation, any prior Docker installations
sudo snap remove docker
sudo apt-get remove docker docker-engine docker.io

# Install latest Docker.io version
sudo apt-get update
sudo apt install docker.io

# Run Docker on startup
sudo systemctl start docker
sudo systemctl enable docker
slimsag commented 4 years ago

Based on the comments it seems this happens on 19.03.3 and 19.03.4, and we had someone reproduce it on Xenial with 19.03.8 as well, but I was NOT able to reproduce it with the following:

Add the apt repository if you don't already have it:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update

Install Docker CE 19.03.8 (latest) explicitly:

sudo apt install docker-ce=19.03.8~3-0~ubuntu-xenial
Jack-2001 commented 2 years ago

does someone know what is the actual reason for this issue?

nvkhoi112358 commented 1 month ago

Still get similar issue, but with version 27.1.1. In this case, the first time nginx start (using docker compose 2.29.1) is ok, but after down then up again it got error. with same configuration on the host, it is ok. with traditional unix socket (visible on filesystem), it is ok.