Synchronization issue between containerd and Docker

akalipetis commented 6 years ago

[x] This is a bug report
[ ] This is a feature request
[x] I searched existing issues before opening this one

Expected behavior

Exec-ing into a running container should always work

# docker exec -it $CONTAINER_ID sh

Actual behavior

# docker exec -it a62f2e2cac55 redis-cli monitor
containerd: container not found
# docker ps
CONTAINER ID        IMAGE                               COMMAND                  CREATED             STATUS              PORTS               NAMES
...
6379/tcp            redis.1.2j2thnb38vta8wr8nogpsnjzc
a62f2e2cac55        redis:3.2-alpine                    "docker-entrypoint..."   13 hours ago        Up 13 hours        
...

Steps to reproduce the behavior

I'm running a single-node 17.10.0-ce Docker Swarm cluster. After a service restarted, a new container was spawned but it was never in sync with containerD.

Output of docker version:

# docker version
Client:
 Version:      17.10.0-ce
 API version:  1.33
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:04:16 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.10.0-ce
 API version:  1.33 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:02:56 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info:

# docker version
Client:
 Version:      17.10.0-ce
 API version:  1.33
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:04:16 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.10.0-ce
 API version:  1.33 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:02:56 2017
 OS/Arch:      linux/amd64
 Experimental: true
root@eldeco-tng:~# docker info
Containers: 55
 Running: 7
 Paused: 0
 Stopped: 48
Images: 33
Server Version: 17.10.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 230
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: uz88iro5wpjht23ncnea52wnn
 Is Manager: true
 ClusterID: mcm76htotvbbvn4yowpqdsgts
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.19.0.5
 Manager Addresses:
  10.19.0.5:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 0351df1c5a66838d0c392b4ac4cf9450de844e2d
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-98-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.953GiB
Name: eldeco-tng
ID: DDH6:NGHL:6UJ5:TL3G:YG7H:7K5M:DTAS:UZON:SML7:NQ7T:TFOT:7OMV
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.)

Ubuntu 16.04 on Digital Ocean.

thaJeztah commented 6 years ago

ping @mlaventure PTAL

mlaventure commented 6 years ago

Is the container process actually running?
Do you have a way to reproduce this?
Could you produce the logs leading to this? (it would be helpful to switch the daemon to debug mode first).

akalipetis commented 6 years ago

This was a production cluster, so I couldn't switch to debug mode and did a reboot to make it go back to normal.

I could see the process running within Docker, but there was no related PID in the system and seems like containers did not have a process either.

I don't have a good way to reproduce this unfortunately. The system was under memory stress before this happened, so this might be related.

thaJeztah commented 6 years ago

The system was under memory stress before this happened, so this might be related.

Could be that the process was OOM killed by the kernel

mlaventure commented 6 years ago

Docker should have still received the message if it was an OOM kill.

Without debug logs it's hard to do an educated guess. But maybe we should update the daemon to automatically remove a container from the running list if it gets a "not found" from containerd.

akalipetis commented 6 years ago

But maybe we should update the daemon to automatically remove a container from the running list if it gets a "not found" from containerd.

I believe this is pretty safe and would be fine as a first step, given that this is not something happening quite commonly.

benbc commented 6 years ago

Could this be related to https://github.com/moby/moby/pull/36173? We see that problem (which manifests itself as dockerd being unable to communicate with containerd) and it's closely associated with OOM-killing events.

docker / for-linux