docker / for-linux

Docker Engine for Linux
https://docs.docker.com/engine/installation/
751 stars 84 forks source link

On server reboot, container exits with code 128, won't retry #293

Open Enderer opened 6 years ago

Enderer commented 6 years ago

Actual behavior

After rebooting the server the container does not start back up. The container tries to start but exists with code 128. This looks like its due to the network volume not being available at the time of startup. It takes a few seconds before the volume is ready. The message "no such device" appears in the error log. Manually starting the container works because the network volume is then available.

The container is set to restart=always but Docker does not attempt to restart the container. RestartCount is 0.

Here is the docker command:

docker run -d \
--name=plex \
--net=host \
--restart=always \
-v /home/user/plex/config:/config \
-v /home/user/plex/transcode:/transcode \
-v /mnt/tanagra/public:/tanagra/public \
linuxserver/plex

Here is the error message from docker inspect:

        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 128,
            "Error": "OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/mnt/tanagra/public\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay2/6a990b540b574977de4d0b6197b3b033e4ab6890813eb592058d005db70337be/merged\\\\\\\" at \\\\\\\"/var/lib/docker/overlay2/6a990b540b574977de4d0b6197b3b033e4ab6890813eb592058d005db70337be/merged/tanagra/public\\\\\\\" caused \\\\\\\"no such device\\\\\\\"\\\"\": unknown",

Output of docker version:

Client:                                    
 Version:      18.03.1-ce                  
 API version:  1.37                        
 Go version:   go1.9.5                     
 Git commit:   9ee9f40                     
 Built:        Thu Apr 26 07:17:20 2018    
 OS/Arch:      linux/amd64                 
 Experimental: false                       
 Orchestrator: swarm                       

Server:                                    
 Engine:                                   
  Version:      18.03.1-ce                 
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5                    
  Git commit:   9ee9f40                    
  Built:        Thu Apr 26 07:15:30 2018   
  OS/Arch:      linux/amd64                
  Experimental: false                      

Output of docker info:

Containers: 1
 Running: 0
 Paused: 0
 Stopped: 1
Images: 10
Server Version: 18.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.0-6-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 15.54GiB
Name: risa
ID: LFCE:TKPE:JDFJ:MZ4E:JDRJ:4HCN:BO2D:SBBT:2HGF:KCDW:OROP:RZWZ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: enderer
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
andymadge commented 6 years ago

I'm seeing the same issue, also with a Plex container, and I'm also bind-mounting a network share.

There are a few differences in my situation - I'm using the official Plex docker image, I'm using macvlan network, and I'm running it with docker-compose.

I'm seeing exactly the same symptoms though.

There are no application logs at all inside the container and no entries in the container logs either (docker-compose logs)

The container starts normally if I do docker-compose up. The container also starts normally if I restart the docker daemon. The issue only occurs at boot.

If I remove the bind-mounted network share, the container starts normally at boot, so it seems that the issue is the container tries to start before the network share has been mounted.

Therefore I'm not sure whether this constitutes a Docker bug to be honest.

Excerpt from my docker-compose.yml

version: '3.1'
services:
  plex:
    image: plexinc/pms-docker:plexpass
    restart: unless-stopped
    networks:
      physical:
        ipv4_address: 192.168.20.208
    hostname: pms-docker
    volumes:
      - plex-config:/config
      - plex-temp:/transcode
      - /mnt/qnap2/multimedia:/media
    devices:
      - /dev/dri:/dev/dri

networks:
  physical:
    external: true

volumes:
  plex-config:
  plex-temp:
$ docker-compose ps
Name   Command    State     Ports
---------------------------------
plex   /init     Exit 128    

Error from docker inspect is the same as above:

        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 128,
            "Error": "OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/mnt/qnap2/multimedia\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay2/2f7c5ceb2dd5ddb0788aa9272b600edef6a4a0edbf154f8963b7075552e7bd16/merged\\\\\\\" at \\\\\\\"/var/lib/docker/overlay2/2f7c5ceb2dd5ddb0788aa9272b600edef6a4a0edbf154f8963b7075552e7bd16/merged/mnt/qnap2/multimedia\\\\\\\" caused \\\\\\\"no such device\\\\\\\"\\\"\": unknown",
            "StartedAt": "2018-06-14T15:43:24.199564037Z",
            "FinishedAt": "2018-06-14T15:49:17.387003284Z",
            "Health": {
                "Status": "unhealthy",
                "FailingStreak": 0,
            }
        }

I'm on a slightly later docker version and I'm on Ubuntu 18.04 LTS

$ docker version
Client:
 Version:      18.05.0-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   f150324
 Built:        Wed May  9 22:16:13 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.05.0-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   f150324
  Built:        Wed May  9 22:14:23 2018
  OS/Arch:      linux/amd64
  Experimental: false
andymadge commented 6 years ago

This issue can be reproduced with this basic container which bind-mounts a network share:

docker container run -d \
--restart=always \
--name testmount \
-v /mnt/qnap2/multimedia:/media \
busybox ping 8.8.8.8

It gives the same behaviour and same error after reboot.

$ docker inspect testmount -f '{{ .State.Error }}'
OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/mnt/qnap2/multimedia\\\" to rootfs \\\"/var/lib/docker/overlay2/4be5925b0d17e6c9c03ddf70ad7108ca184f3f7456599cdb6cfa08713a2af0f2/merged\\\" at \\\"/var/lib/docker/overlay2/4be5925b0d17e6c9c03ddf70ad7108ca184f3f7456599cdb6cfa08713a2af0f2/merged/media\\\" caused \\\"no such device\\\"\"": unknown

If I remove the network bind-mount, then it works and starts correctly after reboot:

docker container run -d \
--restart=always \
--name testnomount \
-v /tmp:/media \
busybox ping 8.8.8.8

Therefore the issue is simply that Docker is attempting to start the container before the mount has completed.

I don't think this can be considered a Docker bug - how is Docker daemon supposed to know to wait for the network mount?

I suspect the fix on a case by case basis is to add an After= rule to the systemd docker.service file.

andymadge commented 6 years ago

Fix for this is to add x-systemd.after=docker.service to the fstab entry. This tells systemd that docker.service shouldn't be started until after the mount has been done.

If the mount fails, the docker server will start as normal.

~Just for info my full working entry from `/etc/fstab/ is:~~

//qnap2/multimedia /mnt/qnap2/multimedia cifs uid=andym,x-systemd.automount,x-systemd.after=docker.service,credentials=/home/username/.smbcredentials,iocharset=utf8 0 0

I spoke too soon. The above does allow the container to start, but the share isn't actually mounted. The above should not be used.

A working fix is to modify the docker /lib/systemd/system/docker.service file. Add RequiresMountsFor=/mnt/qnap2/multimedia to the [Unit] section.

See https://www.freedesktop.org/software/systemd/man/systemd.unit.html#RequiresMountsFor=

This is not ideal since it requires modifying the Docker service each time a container needs a mount, but it does the job.

andymadge commented 6 years ago

It seems this is actually a recurrence of a previous issue https://github.com/moby/moby/issues/17485

Repro steps are nearly identical, apart from different mount type.

irsl commented 6 years ago

I encounter the same issue. Even though the restart policy of my containers is set to unless-stopped, they don't come up if one of the prerequisite mount points are not available at the time Docker attemtps to start them. The retry logic (which otherwise works fine) is not executed. The status is:

            "ExitCode": 255,
            "Error": "OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container ini
t caused \\\"rootfs_linux.go:58: mounting \\\\\\\"...\\\\\\\" to rootfs \\\\\\\"/var/
lib/docker/overlay2/.../merged\\\\\\\" at \\\\\\\"...\\\\\\\" ca
used \\\\\\\"stat ...: no such file or directory\\\\\\\"\\\"\": unknown",
simonk83 commented 5 years ago

Yep, struggling with this as well at the moment. NFS mount is not setup before Docker starts, so the container doesn't work as expected.

xardalph commented 5 years ago

hello, same here, but I only use docker volume on the same server with docker-compose, need to restart every project each time.

rishiloyola commented 5 years ago

Why docker is not trying to restart this container?

vishalmalli commented 5 years ago

Same issue here. If the CIFS share is not mounted, container exits and does not attempt to restart. Container will start fine when started manually once the network share is available.

iroes commented 5 years ago

Something similar happens in my case. I've got an encrypted folder in Synology, with automount enabled. Since it's not mounted yet when the docker service starts, it doesn't start until I manually do it with docker-compose up or using the Synology UI. It doesn't retry even with restart: always set.

Result from docker-compose ps:

  Name                Command                State     Ports 
------------------------------------------------------------
test_bck   /entry.sh supervisord -n - ...   Exit 128         

This is really annoying, since I only use my Synology NAS several hours a day... and I need to start some docker services automatically.

tmtron commented 5 years ago

I see the same issue. My docker-paths are directly mapped to the filesystem of locally attched SSDs.
And in some cases after reboot the containers show Exit 128 and docker does not try to restart them, although restart: always is used.

When I check systemctl status docker, I can see that the docker service is running, but reports "id already in use"

Docker version 18.09.1, build 4c52b90
docker-compose version 1.23.2, build 1110ad01

Is there a way to force docker to restart the services in this case?

alno74d commented 4 years ago

How is this not fixed??? This is extremely annoying, isn't it?

alno74d commented 4 years ago

I gathered extra information for my case: My docker-compose file:

plex:
    image: linuxserver/plex
    container_name: plex
    runtime: nvidia
    environment:
...

Ths output of docker inspect:

[
    {
        "Id": "c12c4d426f8f36848fbe1e4807a46cbd570be56b2534768cfc75e76e03b0e083",
        "Created": "2019-11-24T19:53:46.006747643Z",
        "Path": "/init",
        "Args": [],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 128,
            "Error": "error gathering device information while adding custom device \"/dev/nvidia-modeset\": no such file or directory",
            "StartedAt": "2019-11-25T08:48:31.115776398Z",
            "FinishedAt": "2019-11-25T08:55:31.358738772Z"
        },
...

And my /lib/systemd/system/docker.service :

[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
BindsTo=containerd.service
After=network-online.target firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket
RequiresMountsFor=/zdata/media /dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm /dev/nvidia-uvm-tools /dev/nvidia-modeset

Is there a way to wait for the nvidia driver to be properly loaded other than with "RequiresMountsFor"??

Exadra37 commented 4 years ago

Same issue occurred today after running yum update in an AWS server, but unfortunately I already started the container, thus I cannot inspect it anymore to see more details.

In my case the container is from the official image for Traefik and was having restart set to always, and also some volumes, being on of them /var/run/docker.sock:

version: '2'

services:
  traefik:
    image: traefik:1.7
    restart: always
    ports:
      - 80:80
      - 443:443
    networks:
      - traefik
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /opt/traefik/traefik.toml:/traefik.toml
      - /opt/traefik/acme.json:/acme.json
    container_name: traefik

networks:
  traefik:
    external: true

Anyone from docker can comment on this issue?

Maybe @andrewhsu, @tiborvass, @thaJeztah or @duglin can help in pointing this issue to anyone that can give a hand here.

cherouvim commented 3 years ago

I had this exact situation. I start my containers using --restart unless-stopped. At some point I updated/upgraded the server (Ubuntu) and then rebooted it. A couple of hours after the reboot, most containers stopped, with Exited (128).

$ docker container list --all
CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS                     PORTS                                      NAMES
9f843f571a17        jrcs/letsencrypt-nginx-proxy-companion   "/bin/bash /app/entr…"   5 months ago        Up 4 hours                                                            letsencrypt
2e2daceaa70b        proxy                                    "/app/docker-entrypo…"   5 months ago        Up 4 hours                 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp   proxy
5882d5240bbe        foo3                                     "nginx -g 'daemon of…"   12 months ago       Exited (128) 4 hours ago   80/tcp                                     foo5
ace272f67536        foo3                                     "nginx -g 'daemon of…"   12 months ago       Exited (128) 4 hours ago   80/tcp                                     foo4
f89af68a44d6        foo3                                     "nginx -g 'daemon of…"   12 months ago       Exited (128) 4 hours ago   80/tcp                                     foo3
42be6050e8f2        foo2                                     "nginx -g 'daemon of…"   12 months ago       Exited (128) 4 hours ago   80/tcp                                     foo2
5043b220370f        foo1                                     "nginx -g 'daemon of…"   12 months ago       Exited (128) 4 hours ago   80/tcp                                     foo1

After another reboot everything was fixed. Any ideas on why did this happen, or where should I look at to debug the situation?

thaJeztah commented 3 years ago

A quick glance at the errors mentioned, it looks like all cases are trying to do a bind-mount of an extra disk that is not available yet the moment that docker starts, as commented above as well https://github.com/docker/for-linux/issues/293#issuecomment-397398781

runtime create failed: container_linux.go:348:
starting container process caused process_linux.go:402:
container init caused "rootfs_linux.go:58:
mounting "/mnt/tanagra/public"
to rootfs "/var/lib/docker/overlay2/6a990b540b574977de4d0b6197b3b033e4ab6890813eb592058d005db70337be/merged"
at "/var/lib/docker/overlay2/6a990b540b574977de4d0b6197b3b033e4ab6890813eb592058d005db70337be/merged/tanagra/public"
caused "no such device" "": unknown"
OCI runtime create failed: container_linux.go:348:
starting container process caused process_linux.go:402:
container init caused "rootfs_linux.go:58:
mounting "/mnt/qnap2/multimedia"
to rootfs "/var/lib/docker/overlay2/2f7c5ceb2dd5ddb0788aa9272b600edef6a4a0edbf154f8963b7075552e7bd16/merged"
at "/var/lib/docker/overlay2/2f7c5ceb2dd5ddb0788aa9272b600edef6a4a0edbf154f8963b7075552e7bd16/merged/mnt/qnap2/multimedia"
caused "no such device" "": unknown"

I think the reason the daemon might not continue trying is that it requires the container to start successfully "once", before it will start monitoring the container (to handle restarting the container once it exits). I seem to recall this was done to prevent situations where (e.g. similar to what's discussed here) a "broken" container configuration causing a DOS of the whole daemon.

Perhaps the best solution is to create a systemd drop-in file to delay starting the docker service until after the required mounts are present, similar to https://github.com/containerd/containerd/pull/3741

I see this thread on reddit https://www.reddit.com/r/linuxadmin/comments/5z819x/how_to_have_a_systemd_service_wait_for_a_network/ also mentions global.mount and remote-fs.target, which may be relevant for the NFS shares

thaJeztah commented 3 years ago

Some details in https://www.freedesktop.org/software/systemd/man/systemd.mount.html

Apollo3zehn commented 3 years ago

My "solution" so far is to create a cron job and let that restart the container until the mounted drive is available:

SHELL=/snap/bin/pwsh

@reboot root <path>/autorestart.ps1

copy that script to /etc/cron.d.

autorestart.ps1 is a poweshell script but that may be replaced easily by another script. The content is:

$isRunning = (docker inspect -f '{{.State.Running}}' <mycontainer>) | Out-String

while ($isRunning.TrimEnd() -ne "true")
{
    "Container is not running. Starting container ..."
    docker container start <mycontainer>
    Start-Sleep -Seconds 10
    $isRunning = (docker inspect -f '{{.State.Running}}' <mycontainer>) | Out-String
}

"Done."
mattdale77 commented 3 years ago

I am experiencing this same issue on Ubuntu 20.04 (and just upgraded to 21, same issue) using systemd. The shares in question are from virtualbox. My containers start up fine as they have access to their application configuration on /home, however they cannot access the shares for the data they need to function. The container actually bind to the directory under the mount point and use up ghost space on root (which was very tricky to track down).

I have tried the RequiresMountsFor directive but it does not resolve the issue.

kkretsch commented 2 years ago

I had the same trouble with a simple docker compose file for loki without any remote folders. It semed to fail for just mounting a local file quoting something about mounting through proc.

I therefore created my own systemd startup file for docker, which seems to run now even iv rebooted:

I changed/added these two lines:

Requires=docker.socket containerd.service local-fs.target
RequiresMountsFor=/proc

Full file for reference is here:

root@logger:/etc/systemd/system# cat docker.service 
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service local-fs.target
RequiresMountsFor=/proc

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always

# Note that StartLimit* options were moved from "Service" to "Unit" in systemd 229.
# Both the old, and new location are accepted by systemd 229 and up, so using the old location
# to make them work for either version of systemd.
StartLimitBurst=3

# Note that StartLimitInterval was renamed to StartLimitIntervalSec in systemd 230.
# Both the old, and new name are accepted by systemd 230 and up, so using the old name to make
# this option work for either version of systemd.
StartLimitInterval=60s

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity

# Comment TasksMax if your systemd version does not support it.
# Only systemd 226 and above support this option.
TasksMax=infinity

# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes

# kill only the docker process, not all processes in the cgroup
KillMode=process
OOMScoreAdjust=-500

[Install]
WantedBy=multi-user.target
GeorgeMacFly commented 1 year ago

systemctl status docker

you resolved these issue ???

comment please

mattdale77 commented 1 year ago

I don't remember the details as In no longer use virtualboz but I solved this by changing the systemd priorities. I think I held docker back until the auto mount was complete or I put a sleep in a startup script. I'm sorry In can't remember the details but the solution lies in systemd

Majestic7979 commented 10 months ago

I have this issue on a local bind mount, not a network share, so it's definitely not just that situation. Only one container does this. I'm not sure why. I have restart=always on it, still doesn't retry.

zapotocnylubos commented 8 months ago

Experiencing the same problem with linuxserver/tvheadend, just a local bind volume for recordings. Ubuntu 22.04.3 LTS

tmeuze commented 7 months ago

Having the same issue on Debian 12 and Vaultwarden - local binds only. Unfortunately, the fix suggested by @kkretsch did not work.

Oddly, I have both Vaultwarden and vaultwarden-backup in the same compose file, binding the same local directory (vaultwarden-backup has two additional unrelated binds) - yet only Vaultwarden 128s every reboot; the other container start up just fine.

On a separate host (Debian 11), I'm having the same issue with Traefik (sporadically, by contrast) . In this case as well, multiple additional containers are sharing a common local bind. However, testing without multiple containers binding a common directory yields inconsistent results for me.

dodancs commented 7 months ago

Ubuntu 22.04, Docker 25.0.0, build e758fe5, this is still an issue. For me it happens with any container, that has restart=always.

vimoxshah commented 5 months ago

I have the same issue with Ubuntu 22.04.1 Docker Version 24.0.5. Any solution?

NAM1025 commented 5 months ago

Just going to throw my "I have the same issue" out there. This is incredibly frustrating...

I've also tried mounting the drive via /etc/fstab, but if a docker container references it, even with RequiresMountsFor=/some/path in the systemd config, it causes the drive mount to fail. I've confirmed this by removing the container and rebooting, and the drive will mount fine, but restarting the container and rebooting, it fails to mount again. I'm at a complete loss....

The only work-around I have found is to delay docker from starting.

sudo systemctl edit docker.service

Add

### Editing /etc/systemd/system/docker.service.d/override.conf
### Anything between here and the comment below will become the new contents of the file

[Service]
ExecStartPre=/bin/sleep 30

....

This isn't a foolproof fix though, there is definitely still a chance things will fail to load properly.

FaySmash commented 3 months ago

That's how I solved it: https://gitlab.com/-/snippets/3715249

KhalilSecurity commented 1 month ago

Just going to throw my "I have the same issue" out there. This is incredibly frustrating...

I've also tried mounting the drive via /etc/fstab, but if a docker container references it, even with RequiresMountsFor=/some/path in the systemd config, it causes the drive mount to fail. I've confirmed this by removing the container and rebooting, and the drive will mount fine, but restarting the container and rebooting, it fails to mount again. I'm at a complete loss....

The only work-around I have found is to delay docker from starting.

sudo systemctl edit docker.service

Add

### Editing /etc/systemd/system/docker.service.d/override.conf
### Anything between here and the comment below will become the new contents of the file

[Service]
ExecStartPre=/bin/sleep 30

....

This isn't a foolproof fix though, there is definitely still a chance things will fail to load properly.

Truly wonderful fix. I was distro-hopping and ended up with OpenSUSE. It uses NetworkManager by default, and I assume it has a delay with DHCP or something else, causing Pi-hole to exit with code 128, because I bind port 53 to the host IP. The error message I got was

"Error": "driver failed programming external connectivity on endpoint pihole
 (5ae32e1e4aeee78efab94c2d638e29d918eeef7c355b95c907f7121f293c080a): 
Error starting userland proxy: listen tcp4 172.20.0.20:53: bind: cannot assign requested address",
"StartedAt": "2024-08-02T04:42:00.906011788Z",
"FinishedAt": "2024-08-02T04:54:18.75070572Z",

I used your code, with a delay of 10 sec only, and it worked flawlessly.

[Service]
ExecStartPre=/bin/sleep 10

Thank you