docker / for-linux

Docker Engine for Linux
https://docs.docker.com/engine/installation/
746 stars 85 forks source link

Docker inspect hangs, docker exec hangs and docker kill doesn't work #543

Open mmontagna opened 5 years ago

mmontagna commented 5 years ago

Expected behavior

Actual behavior

docker inspect, docker exec and docker kill hang when run on some containers.

Steps to reproduce the behavior

We are seeing this happen occasionally on some of our servers on random containers.

We initially thought it might be related to #38478 (https://github.com/moby/moby/issues/38478#issuecomment-451526797) but this probably a different issue as we don't see any permission denied messages when attempting to kill the container the command just hangs.

If we kill the containerd-shim process the container stops and the docker daemon appears to recover.

We also do not believe it is related to https://github.com/moby/moby/pull/37831 as our kernel version is 4.4.0-1069-aws which shouldn't be effected?

Output of docker version:

Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        4d60db4
 Built:             Wed Nov  7 00:48:57 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:16:44 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 5
 Running: 4
 Paused: 0
 Stopped: 1
Images: 10
Server Version: 18.09.0
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: syslog
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: c4446665cb9c30056f4998ed953e6d4ff22c7c39
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-1069-aws
Operating System: Ubuntu 16.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.42GiB
Name: ecs-XXX
ID: 72SQ:FNFB:BAFB:TAJB:D6FI:MKJ7:ZYXR:X6JX:PZMP:TLU7:RSQW:TWQK
Docker Root Dir: /tmp/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

Additional environment details (AWS, VirtualBox, physical, etc.)

This is on AWS EC2/ECS hosts.

mmontagna commented 5 years ago

Here is the dockerd stack obtained via curl --unix-socket /var/run/docker.sock http://./debug/pprof/goroutine?debug=2 if relevant https://gist.github.com/mmontagna/0e4a4151b390e39b7a2197ecc4a31d32

tgross commented 5 years ago

I'm a colleague of @mmontagna and I did some digging through these stack traces.

There are a large number of goroutines stuck in semacquire for long periods of time (with the largest cluster all around 5600+ minutes). The mutex is the one on the container.(*State) struct. Examples:

goroutine 115 [semacquire, 5611 minutes] ``` goroutine 115 [semacquire, 5611 minutes]: sync.runtime_Semacquire(0xc421056a74) /usr/local/go/src/runtime/sema.go:56 +0x3b sync.(*RWMutex).RLock(0xc421056a68) /usr/local/go/src/sync/rwmutex.go:50 +0x4b github.com/docker/docker/daemon/exec.(*Store).List(0xc421056a60, 0x0, 0x0, 0x0) /go/src/github.com/docker/docker/daemon/exec/exec.go:140 +0x4d github.com/docker/docker/daemon.(*Daemon).containerExecIds(0xc4208e01e0, 0xc4212def50) /go/src/github.com/docker/docker/daemon/exec.go:319 +0xe9 github.com/docker/docker/daemon.(*Daemon).execCommandGC(0xc4208e01e0) /go/src/github.com/docker/docker/daemon/exec.go:296 +0x17f created by github.com/docker/docker/daemon.NewDaemon /go/src/github.com/docker/docker/daemon/daemon.go:980 +0x2b17 ... (truncated) ```
goroutine 62324236 [semacquire, 3708 minutes] ``` goroutine 62324236 [semacquire, 3708 minutes]: sync.runtime_SemacquireMutex(0xc4207f6204, 0x0) /usr/local/go/src/runtime/sema.go:71 +0x3f sync.(*Mutex).Lock(0xc4207f6200) /usr/local/go/src/sync/mutex.go:134 +0x10a github.com/docker/docker/daemon.(*Daemon).ContainerInspectCurrent(0xc4208e01e0, 0xc430520376, 0x3c, 0x0, 0xc44382e500, 0xc44382e5a0, 0xc42ab5a8b0) /go/src/github.com/docker/docker/daemon/inspect.go:40 +0x8e github.com/docker/docker/daemon.(*Daemon).ContainerInspect(0xc4208e01e0, 0xc430520376, 0x3c, 0x0, 0xc430520366, 0x4, 0x200, 0x8d, 0x8e, 0xc4212b87c8) /go/src/github.com/docker/docker/daemon/inspect.go:29 +0x11d github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersByName(0xc420189e40, 0x5596ce474d60, 0xc44919a390, 0x5596ce473160, 0xc44d3500e0, 0xc4461cd300, 0xc43c5d2870, 0x5596cd7fa790, 0x5) /go/src/github.com/docker/docker/api/server/router/container/inspect.go:15 +0x119 github.com/docker/docker/api/server/router/container.(*containerRouter).(github.com/docker/docker/api/server/router/container.getContainersByName)-fm(0x5596ce474d60, 0xc44919a390, 0x5596ce473160, 0xc44d3500e0, 0xc4461cd300, 0xc43c5d2870, 0x5596cc3664bc, 0x5596ce2b4000) /go/src/github.com/docker/docker/api/server/router/container/container.go:39 +0x6b github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1(0x5596ce474d60, 0xc44919a390, 0x5596ce473160, 0xc44d3500e0, 0xc4461cd300, 0xc43c5d2870, 0x5596ce474d60, 0xc44919a390) /go/src/github.com/docker/docker/api/server/middleware/experimental.go:26 +0xda github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1(0x5596ce474d60, 0xc44919a360, 0x5596ce473160, 0xc44d3500e0, 0xc4461cd300, 0xc43c5d2870, 0x0, 0x100) /go/src/github.com/docker/docker/api/server/middleware/version.go:62 +0x401 github.com/docker/docker/pkg/authorization.(*Middleware).WrapHandler.func1(0x5596ce474d60, 0xc44919a360, 0x5596ce473160, 0xc44d3500e0, 0xc4461cd300, 0xc43c5d2870, 0x5596ce474d60, 0xc44919a360) /go/src/github.com/docker/docker/pkg/authorization/middleware.go:59 +0x7ab github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1(0x5596ce473160, 0xc44d3500e0, 0xc4461cd300) /go/src/github.com/docker/docker/api/server/server.go:141 +0x19a net/http.HandlerFunc.ServeHTTP(0xc4208eb340, 0x5596ce473160, 0xc44d3500e0, 0xc4461cd300) /usr/local/go/src/net/http/server.go:1947 +0x46 github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc42018bb80, 0x5596ce473160, 0xc44d3500e0, 0xc4461cd300) /go/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:103 +0x228 github.com/docker/docker/api/server.(*routerSwapper).ServeHTTP(0xc420c76a20, 0x5596ce473160, 0xc44d3500e0, 0xc4461cd300) /go/src/github.com/docker/docker/api/server/router_swapper.go:29 +0x72 net/http.serverHandler.ServeHTTP(0xc4201b9380, 0x5596ce473160, 0xc44d3500e0, 0xc4461cd300) /usr/local/go/src/net/http/server.go:2697 +0xbe net/http.(*conn).serve(0xc455728b40, 0x5596ce474ca0, 0xc423206740) /usr/local/go/src/net/http/server.go:1830 +0x653 created by net/http.(*Server).Serve /usr/local/go/src/net/http/server.go:2798 +0x27d ```

Realizing something must be blocking these I went through the goroutines that were in select and found only 1 of them that matched the approximate lock time:

goroutine 59097853 [select, 5613 minutes] ``` goroutine 59097853 [select, 5613 minutes]: github.com/docker/docker/vendor/google.golang.org/grpc/transport.(*Stream).waitOnHeader(0xc420948c30, 0x10, 0xc429171270) /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/transport/transport.go:222 +0x101 github.com/docker/docker/vendor/google.golang.org/grpc/transport.(*Stream).RecvCompress(0xc420948c30, 0x5596ce43f400, 0xc429171330) /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/transport/transport.go:233 +0x2d github.com/docker/docker/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc4255e4000, 0x5596ce2a2aa0, 0xc42181e558, 0x0, 0x0) /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/stream.go:515 +0x63b github.com/docker/docker/vendor/google.golang.org/grpc.(*clientStream).RecvMsg(0xc421390100, 0x5596ce2a2aa0, 0xc42181e558, 0x0, 0x0) /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/stream.go:395 +0x45 github.com/docker/docker/vendor/google.golang.org/grpc.invoke(0x5596ce474d60, 0xc44335b1d0, 0x5596cd844f6c, 0x27, 0x5596ce2ae880, 0xc420731580, 0x5596ce2a2aa0, 0xc42181e558, 0xc4209a6000, 0xc42096a6e0, ...) /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/call.go:83 +0x185 github.com/docker/docker/vendor/github.com/containerd/containerd.namespaceInterceptor.unary(0x5596cd7f97fb, 0x4, 0x5596ce474ce0, 0xc42003e038, 0x5596cd844f6c, 0x27, 0x5596ce2ae880, 0xc420731580, 0x5596ce2a2aa0, 0xc42181e558, ...) /go/src/github.com/docker/docker/vendor/github.com/containerd/containerd/grpc.go:35 +0xf6 github.com/docker/docker/vendor/github.com/containerd/containerd.(namespaceInterceptor).(github.com/docker/docker/vendor/github.com/containerd/containerd.unary)-fm(0x5596ce474ce0, 0xc42003e038, 0x5596cd844f6c, 0x27, 0x5596ce2ae880, 0xc420731580, 0x5596ce2a2aa0, 0xc42181e558, 0xc4209a6000, 0x5596ce43f470, ...) /go/src/github.com/docker/docker/vendor/github.com/containerd/containerd/grpc.go:51 +0xf6 github.com/docker/docker/vendor/google.golang.org/grpc.(*ClientConn).Invoke(0xc4209a6000, 0x5596ce474ce0, 0xc42003e038, 0x5596cd844f6c, 0x27, 0x5596ce2ae880, 0xc420731580, 0x5596ce2a2aa0, 0xc42181e558, 0x0, ...) /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/call.go:35 +0x10b github.com/docker/docker/vendor/google.golang.org/grpc.Invoke(0x5596ce474ce0, 0xc42003e038, 0x5596cd844f6c, 0x27, 0x5596ce2ae880, 0xc420731580, 0x5596ce2a2aa0, 0xc42181e558, 0xc4209a6000, 0x0, ...) /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/call.go:60 +0xc3 github.com/docker/docker/vendor/github.com/containerd/containerd/api/services/tasks/v1.(*tasksClient).Get(0xc42181e528, 0x5596ce474ce0, 0xc42003e038, 0xc420731580, 0x0, 0x0, 0x0, 0x5596cc4955a4, 0x5596ce253a60, 0xc420730d40) /go/src/github.com/docker/docker/vendor/github.com/containerd/containerd/api/services/tasks/v1/tasks.pb.go:448 +0xd4 github.com/docker/docker/vendor/github.com/containerd/containerd.(*process).Status(0xc446a4f500, 0x5596ce474ce0, 0xc42003e038, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x5596ce452420, ...) /go/src/github.com/docker/docker/vendor/github.com/containerd/containerd/process.go:220 +0x11b github.com/docker/docker/vendor/github.com/containerd/containerd.(*process).Delete(0xc446a4f500, 0x5596ce474ce0, 0xc42003e038, 0x0, 0x0, 0x0, 0xc421c74840, 0x5596ce487ac0, 0xc446a4f500) /go/src/github.com/docker/docker/vendor/github.com/containerd/containerd/process.go:196 +0x119 github.com/docker/docker/libcontainerd.(*client).Exec(0xc4207c2540, 0x5596ce474d20, 0xc4228d08a0, 0xc4207078c0, 0x40, 0xc442ec9f80, 0x40, 0xc4209601a0, 0x0, 0xc420fb4a30, ...) /go/src/github.com/docker/docker/libcontainerd/client_daemon.go:387 +0x541 github.com/docker/docker/daemon.(*Daemon).ContainerExecStart(0xc4208e01e0, 0x5596ce474d20, 0xc4228d08a0, 0xc442ec9f80, 0x40, 0x0, 0x0, 0x5596ce450440, 0xc421240480, 0x5596ce450440, ...) /go/src/github.com/docker/docker/daemon/exec.go:251 +0xb4f github.com/docker/docker/daemon.(*cmdProbe).run(0xc420a8b6b4, 0x5596ce474d20, 0xc4228d08a0, 0xc4208e01e0, 0xc4200e9d40, 0x0, 0xc43c83ca80, 0xc42859af60) /go/src/github.com/docker/docker/daemon/health.go:97 +0x63e github.com/docker/docker/daemon.monitor.func1(0x5596ce450420, 0xc420a8b6b4, 0x5596ce474d20, 0xc4228d08a0, 0xc4208e01e0, 0xc4200e9d40, 0xc421c75800, 0xbf027b28440bb2d0, 0x2f5911231e109, 0x5596cf060de0) /go/src/github.com/docker/docker/daemon/health.go:202 +0xa8 created by github.com/docker/docker/daemon.monitor /go/src/github.com/docker/docker/daemon/health.go:200 +0x2ff ```

I followed this stack trace through and found this recent commit https://github.com/moby/moby/commit/332f134890246cfc73703b2911c9fdc20e063096 which was merged in https://github.com/moby/moby/pull/38383 and https://github.com/docker/engine/pull/140.

I'm having trouble finding the original description of the deadlocking issue described by these PRs. Does what we're seeing here match up with that issue? If so, am I correct in thinking this should be fixed in 18.09.1? And is there a timeline for stable release if so? Thank you!

thaJeztah commented 5 years ago

@tgross if you have an environment to test/reproduce; could you try with Docker 18.09.1 ? Docker 18.09.1 comes with updated versions of containerd and runc (and I recall there was a fix that fixed hangs)

tgross commented 5 years ago

We haven't been able to repro it reliably, which is why I was asking about what issue was origin of those commits because it might help us develop a repro.

But glad to see 18.09.1 just shipped... I missed that it got tagged a few hours before my last comment. At worst we can give that a spin and try to soak test it a bit in one of our environments.

thaJeztah commented 5 years ago

which is why I was asking about what issue was origin of those commits because it might help us develop a repro.

Sorry; looks like I overlooked the part about those commits.

I followed this stack trace through and found this recent commit moby/moby@332f134 which was merged in moby/moby#38383 and docker/engine#140.

Yes, https://github.com/moby/moby/pull/38383 was one of the changes; another one was https://github.com/moby/moby/pull/38376 / https://github.com/docker/engine/pull/138 (which brought in https://github.com/containerd/containerd/pull/2877 / https://github.com/containerd/containerd/pull/2878)

I'm having trouble finding the original description of the deadlocking issue described by these PRs. Does what we're seeing here match up with that issue? If so, am I correct in thinking this should be fixed in 18.09.1? And is there a timeline for stable release if so? Thank you!

Yes, the description is a bit scarse; there's some internal issues attached:

The trouble was reproducing the issue as it took days before it occured, and the containerd-shim has limited debugging options; signaling the process could also unstuck the process.

From

The QA core where they reproduced the issue yesterday has no longer the issue. They are trying to reproduce again but it might take a day or so, .. (redacted) .. in other words when we have a go routine in the engine stuck on ContainerExecStart and looks similar to this (this can be see in the stack dump):

> ``` > goroutine 282461 [select, 3073 minutes]: > github.com/docker/docker/vendor/google.golang.org/grpc/transport.(*Stream).waitOnHeader(0xc42171a0f0, 0x10, 0xc421df3270) > /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/transport/transport.go:222 +0x101 > github.com/docker/docker/vendor/google.golang.org/grpc/transport.(*Stream).RecvCompress(0xc42171a0f0, 0x56325e1ed568, 0xc421df3330) > /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/transport/transport.go:233 +0x2d > github.com/docker/docker/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc422cecf70, 0x56325e03d5a0, 0xc42000e148, 0x0, 0x0) > /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/stream.go:515 +0x63b > github.com/docker/docker/vendor/google.golang.org/grpc.(*clientStream).RecvMsg(0xc422a9e280, 0x56325e03d5a0, 0xc42000e148, 0x0, 0x0) > /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/stream.go:395 +0x45 > github.com/docker/docker/vendor/google.golang.org/grpc.invoke(0x56325e224f40, 0xc4223f0f00, 0x56325d4daa07, 0x27, 0x56325e049e00, 0xc420f32940, 0x56325e03d5a0, 0xc42000e148, 0xc420a1e000, 0xc4209e82a0, ...) > /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/call.go:83 +0x185 > github.com/docker/docker/vendor/github.com/containerd/containerd.namespaceInterceptor.unary(0x56325d489a03, 0x4, 0x56325e224ec0, 0xc4200ea040, 0x56325d4daa07, 0x27, 0x56325e049e00, 0xc420f32940, 0x56325e03d5a0, 0xc42000e148, ...) > /go/src/github.com/docker/docker/vendor/github.com/containerd/containerd/grpc.go:35 +0xf6 > github.com/docker/docker/vendor/github.com/containerd/containerd.(namespaceInterceptor).(github.com/docker/docker/vendor/github.com/containerd/containerd.unary)-fm(0x56325e224ec0, 0xc4200ea040, 0x56325d4daa07, 0x27, 0x56325e049e00, +++0xc420f32940, 0x56325e03d5a0, 0xc42000e148, 0xc420a1e000, 0x56325e1ed5d8, ...) > /go/src/github.com/docker/docker/vendor/github.com/containerd/containerd/grpc.go:51 +0xf6 > github.com/docker/docker/vendor/google.golang.org/grpc.(*ClientConn).Invoke(0xc420a1e000, 0x56325e224ec0, 0xc4200ea040, 0x56325d4daa07, 0x27, 0x56325e049e00, 0xc420f32940, 0x56325e03d5a0, 0xc42000e148, 0x0, ...) > /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/call.go:35 +0x10b > github.com/docker/docker/vendor/google.golang.org/grpc.Invoke(0x56325e224ec0, 0xc4200ea040, 0x56325d4daa07, 0x27, 0x56325e049e00, 0xc420f32940, 0x56325e03d5a0, 0xc42000e148, 0xc420a1e000, 0x0, ...) > /go/src/github.com/docker/docker/vendor/google.golang.org/grpc/call.go:60 +0xc3 > github.com/docker/docker/vendor/github.com/containerd/containerd/api/services/tasks/v1.(*tasksClient).Get(0xc42000e140, 0x56325e224ec0, 0xc4200ea040, 0xc420f32940, 0x0, 0x0, 0x0, 0x56325bf2cfb6, 0x56325dfec060, 0xc420f32920) > /go/src/github.com/docker/docker/vendor/github.com/containerd/containerd/api/services/tasks/v1/tasks.pb.go:448 +0xd4 > github.com/docker/docker/vendor/github.com/containerd/containerd.(*process).Status(0xc423bb59b0, 0x56325e224ec0, 0xc4200ea040, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x56325e201900, ...) > /go/src/github.com/docker/docker/vendor/github.com/containerd/containerd/process.go:220 +0x11b > github.com/docker/docker/vendor/github.com/containerd/containerd.(*process).Delete(0xc423bb59b0, 0x56325e224ec0, 0xc4200ea040, 0x0, 0x0, 0x0, 0xc4210150e0, 0x56325e2382c0, 0xc423bb59b0) > /go/src/github.com/docker/docker/vendor/github.com/containerd/containerd/process.go:196 +0x119 > github.com/docker/docker/libcontainerd.(*client).Exec(0xc4206ac4d0, 0x56325e224f00, 0xc42091a060, 0xc420b47b00, 0x40, 0xc420b46140, 0x40, 0xc422cec340, 0x0, 0xc42236c220, ...) > /go/src/github.com/docker/docker/libcontainerd/client_daemon.go:387 +0x541 > github.com/docker/docker/daemon.(*Daemon).ContainerExecStart(0xc4205de800, 0x56325e224f00, 0xc42091a060, 0xc420b46140, 0x40, 0x0, 0x0, 0x56325e1ff780, 0xc422a9e080, 0x56325e1ff780, ...) > /go/src/github.com/docker/docker/daemon/exec.go:251 +0xb4f > github.com/docker/docker/daemon.(*cmdProbe).run(0xc420e15c24, 0x56325e224f00, 0xc42091a060, 0xc4205de800, 0xc420717900, 0xc422cc86e0, 0x56325bcf2950, 0xc4217a6c00) > /go/src/github.com/docker/docker/daemon/health.go:97 +0x63e > github.com/docker/docker/daemon.monitor.func1(0x56325e1ff760, 0xc420e15c24, 0x56325e224f00, 0xc42091a060, 0xc4205de800, 0xc420717900, 0xc42091acc0, 0xbef773fb45103143, 0x3af7f96f023, 0x56325eead3c0) > /go/src/github.com/docker/docker/daemon/health.go:202 +0xa8 > created by github.com/docker/docker/daemon.monitor > /go/src/github.com/docker/docker/daemon/health.go:200 +0x2ff > ```

we need to generate a stack dump from containerd POV to see what's going on (the stack will be in /var/log/message);

kill -s USR1 $(pidof containerd)

That stacktrace looks the same as the last one you posted, so I think the 18.09.1 update may resolve the situation.

Note that there's still work ongoing to root-cause the problem; the current patch (https://github.com/containerd/containerd/pull/2877) is to prevent the system locking up.

HTH 👍

thaJeztah commented 5 years ago

@dani-docker @tonistiigi is there any information that could be helpful (if you're still root-causing the issue)?

tgross commented 5 years ago

FWIW @thaJeztah all three of those QA/report links 404 for me (probably private-to-the-org?)

tonistiigi commented 5 years ago

This issue appeared when exec/healthcheck is canceled or receives an error in a specific timeframe when containerd has launched the shim process but has not opened fifos for the stdio communication yet.

thaJeztah commented 5 years ago

all three of those QA/report links 404 for me (probably private-to-the-org?

Ah, yes, I deliberately didn't posted the as links, but GitHub made them links still. Yes; those are the internal reference numbers of those reports 😅

frankui commented 4 years ago

We also met the similar issue on 18.09.9 docker-ce with containerd 1.2.10 and kernel is 4.19.0-041900-generic. Not all containers but some of them will hit the hang issue.

The container process and shim process are alive as pstree shows but the docker inspect/exec/stop hang when docker ps can still find the container.

cloud-user@hm6-opshub-aio:~$ docker info | grep -i "Version|runc|containerd|shim"
WARNING: No swap limit support
Server Version: 18.09.9
Runtimes: runc
Default Runtime: runc
containerd version: b34a5c8af56e510852c35414db4c1f4fa6172339
runc version: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
init version: fec3683
Kernel Version: 4.19.0-041900-generic
cloud-user@hm6-opshub-aio:~$docker ps | grep zookeeper-0
7d1dbb28dac3 d43be510ec44 "/custom-entrypoint.…" 7 hours ago Up 7 hours k8s_zookeeper_zookeeper-0_opshub-data_ca906dcf-c623-460a-9271-1389b09912b0_1
fb0855d4e2e6 k8s.gcr.io/pause:3.1 "/pause" 46 hours ago Up 46 hours k8s_POD_zookeeper-0_opshub-data_ca906dcf-c623-460a-9271-1389b09912b0_0
cloud-user@hm6-opshub-aio:$ docker inspect 7d1dbb28dac3
^C
cloud-user@hm6-opshub-aio:$
cloud-user@hm6-opshub-aio:$ ps -ef | grep 7d1dbb28dac3
cloud-u+ 5797 2133 0 00:31 pts/0 00:00:00 grep --color=auto 7d1dbb28dac3
root 20154 1640 0 Nov11 ? 00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/7d1dbb28dac3216b2b5cbafa5c6b9dbb4a6565e56d9bd8c6f060ba7487b55501 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
cloud-user@hm6-opshub-aio:~$ pstree -Asp 20154
systemd(1)---containerd(1640)---containerd-shim(20154)-+-custom-entrypoi(25587)---java(17248)-+-{java}(17702)
                                                       |                                      |-{java}(18065)
                                                       |                                      |-{java}(6038)
                                                       |                                      |-{java}(14587)
                                                       |                                      |-{java}(14793)
                                                       |                                      |-{java}(15040)
                                                       |                                      |-{java}(19281)
                                                       |                                      |-{java}(19301)
                                                       |                                      |-{java}(11909)
                                                       |                                      |-{java}(12716)
                                                       |                                      |-{java}(13990)
                                                       |                                      |-{java}(14318)
                                                       |                                      |-{java}(15122)
                                                       |                                      |-{java}(15182)
                                                       |                                      |-{java}(22772)
                                                       |                                      |-{java}(26839)
                                                       |                                      |-{java}(27524)
                                                       |                                      |-{java}(18492)
                                                       |                                      |-{java}(18010)
                                                       |                                      |-{java}(18022)
                                                       |                                      |-{java}(18490)
                                                       |                                      |-{java}(20045)
                                                       |                                      |-{java}(20113)
                                                       |                                      |-{java}(9862)
                                                       |                                      |-{java}(7946)
                                                       |                                      |-{java}(18670)
                                                       |                                      |-{java}(19502)
                                                       |                                      |-{java}(19504)
                                                       |                                      |-{java}(22857)
                                                       |                                      |-{java}(22858)
                                                       |                                      |-{java}(23127)
                                                       |                                      |-{java}(23178)
                                                       |                                      |-{java}(1632)
                                                       |                                      |-{java}(1602)
                                                       |                                      |-{java}(15553)
                                                       |                                      |-{java}(19236)
                                                       |                                      |-{java}(22269)
                                                       |                                      |-{java}(503)
                                                       |                                      |-{java}(18600)
                                                       |                                      |-{java}(22767)
                                                       |                                      |-{java}(20046)
                                                       |                                      |-{java}(20061)
                                                       |                                      |-{java}(20129)
                                                       |                                      |-{java}(26671)
                                                       |                                      |-{java}(17418)
                                                       |                                      |-{java}(18604)
                                                       |                                      |-{java}(19293)
                                                       |                                      |-{java}(19302)
                                                       |                                      |-{java}(19487)
                                                       |                                      |-{java}(22024)
                                                       |                                      |-{java}(7675)
                                                       |                                      |-{java}(17292)
                                                       |                                      |-{java}(32502)
                                                       |                                      |-{java}(18581)
                                                       |                                      |-{java}(29362)
                                                       |                                      |-{java}(27718)
                                                       |                                      |-{java}(29804)
                                                       |                                      |-{java}(4248)
                                                       |                                      |-{java}(17326)
                                                       |                                      |-{java}(17339)
                                                       |                                      |-{java}(17350)
                                                       |                                      |-{java}(17449)
                                                       |                                      |-{java}(1625)
                                                       |                                      |-{java}(1638)
                                                       |                                      |-{java}(22305)
                                                       |                                      |-{java}(19299)
                                                       |                                      |-{java}(19314)
                                                       |                                      |-{java}(19363)
                                                       |                                      |-{java}(13804)
                                                       |                                      |-{java}(13848)
                                                       |                                      |-{java}(13964)
                                                       |                                      |-{java}(14855)
                                                       |                                      |-{java}(7284)
                                                       |                                      |-{java}(29541)
                                                       |                                      `-{java}(30103)
                                                       |-{containerd-shim}(20261)
                                                       |-{containerd-shim}(20385)
                                                       |-{containerd-shim}(20438)
                                                       |-{containerd-shim}(20769)
                                                       |-{containerd-shim}(21254)
                                                       |-{containerd-shim}(21839)
                                                       |-{containerd-shim}(22173)
                                                       |-{containerd-shim}(22238)
                                                       |-{containerd-shim}(30808)
                                                       `-{containerd-shim}(3440)
cloud-user@hm6-opshub-aio:~$ 
a-a-abramov commented 4 years ago

Almost the some thing as @frankui described here.

# docker --version
Docker version 18.09.5, build e8ff056
# containerd --version
containerd github.com/containerd/containerd 1.2.5 bb71b10fd8f58240ca47fbb579b9d1028eea7c84
# uname -r 
4.19.27-gentoo-r1-xen-guest-cgroups
# docker info 
Containers: 5
 Running: 5
 Paused: 0
 Stopped: 0
Images: 35
Server Version: 18.09.5
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683b971d9c3ef73f284f176672c44b448662
Security Options:
 seccomp
  Profile: default
 userns
Kernel Version: 4.19.27-gentoo-r1-xen-guest-cgroups
Operating System: Gentoo/Linux
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.802GiB
Name: macroserver1.msk1.rbkmoney.net
ID: WFS2:RT76:Q6JA:BUPB:GX4A:U4ZD:DPUL:PYLQ:UQ65:LWAE:7NIS:GO6I
Docker Root Dir: /var/lib/docker/100000.100000
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

WARNING: No swap limit support

We have dozens of hosts with this configuration in production and still didn't find a way reproduce the issue reliably. Would greatly appreciate a workaround!

frankui commented 4 years ago

Almost the some thing as @frankui described here.

# docker --version
Docker version 18.09.5, build e8ff056
# containerd --version
containerd github.com/containerd/containerd 1.2.5 bb71b10fd8f58240ca47fbb579b9d1028eea7c84
# uname -r 
4.19.27-gentoo-r1-xen-guest-cgroups
# docker info 
Containers: 5
 Running: 5
 Paused: 0
 Stopped: 0
Images: 35
Server Version: 18.09.5
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683b971d9c3ef73f284f176672c44b448662
Security Options:
 seccomp
  Profile: default
 userns
Kernel Version: 4.19.27-gentoo-r1-xen-guest-cgroups
Operating System: Gentoo/Linux
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.802GiB
Name: macroserver1.msk1.rbkmoney.net
ID: WFS2:RT76:Q6JA:BUPB:GX4A:U4ZD:DPUL:PYLQ:UQ65:LWAE:7NIS:GO6I
Docker Root Dir: /var/lib/docker/100000.100000
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

WARNING: No swap limit support

We have dozens of hosts with this configuration in production and still didn't find a way reproduce the issue reliably. Would greatly appreciate a workaround!

Almost the some thing as @frankui described here.

# docker --version
Docker version 18.09.5, build e8ff056
# containerd --version
containerd github.com/containerd/containerd 1.2.5 bb71b10fd8f58240ca47fbb579b9d1028eea7c84
# uname -r 
4.19.27-gentoo-r1-xen-guest-cgroups
# docker info 
Containers: 5
 Running: 5
 Paused: 0
 Stopped: 0
Images: 35
Server Version: 18.09.5
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683b971d9c3ef73f284f176672c44b448662
Security Options:
 seccomp
  Profile: default
 userns
Kernel Version: 4.19.27-gentoo-r1-xen-guest-cgroups
Operating System: Gentoo/Linux
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.802GiB
Name: macroserver1.msk1.rbkmoney.net
ID: WFS2:RT76:Q6JA:BUPB:GX4A:U4ZD:DPUL:PYLQ:UQ65:LWAE:7NIS:GO6I
Docker Root Dir: /var/lib/docker/100000.100000
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

WARNING: No swap limit support

We have dozens of hosts with this configuration in production and still didn't find a way reproduce the issue reliably. Would greatly appreciate a workaround!

@a-a-abramov We found the trigger of our issue. The trigger is the thread leak in some of our micro-services. When the thread number reaches to the kernel pid_max limitation, the new thread or process can not be created that may cause the container to fail to be stopped by kuebelet. After we fixed the thread leak, the issue has never been observed even with docker-ce 18.06.

a-a-abramov commented 4 years ago

@frankui thanks a lot for the idea, we'll look into this :)

timchenxiaoyu commented 4 years ago

docker mount cephfs , created container inspect hang forever

scaredfinger commented 4 years ago

I have one container in this state right now. I don't need to delete it. If you want me to extract some info, please tell me what to do.

My box has docker 19.03.5.

The container process is no longer there, I have not been able to find any kernel message poiting to when it died or why. The container-shim is still there, I have tried killing it in the pass and that has worked but the container continues refusing to die.

Thanks

chilloutman commented 4 years ago

I've observed similar behavior multiple times now. While restarting, stopping the daemon did not go without a hitch. I've attached the log output including stack traces here: docker_daemon_stop.log

Output of docker info

Client:
 Debug Mode: false

Server:
 Containers: 2
  Running: 1
  Paused: 0
  Stopped: 1
 Images: 77
 Server Version: 19.03.8
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: local
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: oob717obo4ujzmifz83hi1klj
  Is Manager: false
  Node Address: 87.239.210.11
  Manager Addresses:
   87.239.209.126:2377
   87.239.209.167:2377
   87.239.211.1:2377
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.15.0-106-generic
 Operating System: Ubuntu 18.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 31.38GiB
 Name: docker-swarm-5
 ID: DC7B:GOXZ:HGXG:5FZW:MKQC:BNNX:TUBW:BQP6:KK5B:YLSA:J5EJ:Q6HJ
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Output of docker version:

Client: Docker Engine - Community
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        afacb8b7f0
 Built:             Wed Mar 11 01:25:46 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       afacb8b7f0
  Built:            Wed Mar 11 01:24:19 2020
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
sanjay-90210 commented 3 years ago

docker version Client: Docker Engine - Community Version: 20.10.5 API version: 1.41 Go version: go1.13.15 Git commit: 55c4c88 Built: Tue Mar 2 20:18:15 2021 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.6 API version: 1.41 (minimum version 1.12) Go version: go1.13.15 Git commit: 8728dd2 Built: Fri Apr 9 22:45:24 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.4 GitCommit: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e runc: Version: 1.0.0-rc93 GitCommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec docker-init: Version: 0.19.0 GitCommit: de40ad0

docker info Client: Context: default Debug Mode: false Plugins: app: Docker App (Docker Inc., v0.9.1-beta3) buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server: Containers: 11 Running: 11 Paused: 0 Stopped: 0 Images: 12 Server Version: 20.10.6 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux Default Runtime: runc Init Binary: docker-init containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec init version: de40ad0 Security Options: apparmor seccomp Profile: default Kernel Version: 4.4.0-208-generic Operating System: Ubuntu 16.04.2 LTS OSType: linux Architecture: x86_64 CPUs: 1 Total Memory: 23.55GiB Name: bulldog-1 ID: 6KHI:GJKE:4WOD:AK3W:33HS:FB7E:JYR2:453J:LKZA:NSEJ:UZFW:AKPT Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 192.168.0.178:5000 127.0.0.0/8 Live Restore Enabled: false

WARNING: No swap limit support

Docker exec gets stuck with these logs lines: Apr 20 22:18:04 bulldog-1 dockerd[28396]: time="2021-04-20T22:18:04.030935425+05:30" level=error msg="Handler for POST /v1.41/exec/cfa9bc1f5616a2d0c7ca0b5f77847fbe210f4f87b331a56fea4ac21f61bfaf14/resize returned error: timeout waiting for exec session ready" Apr 20 22:18:14 bulldog-1 dockerd[28396]: time="2021-04-20T22:18:14.047437871+05:30" level=error msg="Handler for POST /v1.41/exec/cfa9bc1f5616a2d0c7ca0b5f77847fbe210f4f87b331a56fea4ac21f61bfaf14/resize returned error: timeout waiting for exec session ready" Apr 20 22:18:24 bulldog-1 dockerd[28396]: time="2021-04-20T22:18:24.060684000+05:30" level=error msg="Handler for POST /v1.41/exec/cfa9bc1f5616a2d0c7ca0b5f77847fbe210f4f87b331a56fea4ac21f61bfaf14/resize returned error: timeout waiting for exec session ready"

Stack trace file attached: Apr 20 21:57:53 bulldog-1 dockerd[28396]: time="2021-04-20T21:57:53.361896881+05:30" level=info msg="goroutine stacks written to /var/run/docker/goroutine-stacks-2021-04-20T215753+0530.log"

goroutine-stacks-2021-04-20T215753+0530.log

cpuguy83 commented 3 years ago

@sanjay-90210 Thanks for the stack trace. It looks like it is stuck trying to start the exec process.

This looks very similar to a bug caused by an issue with runc 1.0.0-rc93, which is what you appear to be running. I'd recommend downgrading runc to rc92 and I expect this will clear it up (after restarting Docker).

sanjay-90210 commented 3 years ago

@cpuguy83 Thanks for the workaround.

I downgraded just the runc executable to 1.0.0-rc92 and the containers have been up for the last 9hrs without showing the stuck docker issue. The strange thing is that I have another box with the same setup, same os and patch level, same docker version with runc 1.0.0-rc93 and it works fine without showing this issue. With 1.0.0-rc93, in the box with issues, the issue does not show up as soon as the containers are up, it only shows up after a few hrs. I will leave the boxes alone overnight and see if the downgrade works fine.

sanjay-90210 commented 3 years ago

@cpuguy83 - Thank you So after 35hrs, docker exec seems to be holding up on the box with the 1.0.0-rc92 even though on the other box with 1.0.0-rc93, it still works fine.

bittorf commented 1 month ago

same here for {{Docker version 20.10.21, build baeda1f}}

$ sudo docker exec -it "$ID" -- /bin/bash
<hangs forever>

no problems on other containers, container itself reports healthy state.

cpuguy83 commented 1 month ago

@bittorf

Please upgrade all your components (20.10 is very old). If the problem still persists, please open a new issue and provide stack traces. This issue was closed 5 years ago.