Open pekuz opened 2 years ago
The image container has started normally:
[eesbadmin@eu50mqvq019 ~ ]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
43e0fdccad0e docker-eurofins-eesb.packages.eurofins.local/com/eurofins/eesb/eesb-monitoring-agent-keep:3.0.1 "sh -c 'java ${JAVA_…" 5 hours ago Up 5 hours 0.0.0.0:32000->8080/tcp uat-itaag20-hmb01-eesb-monitoring-agent-keep-01
but subsequent stop / exec / inspect hung on one of the hosts only.
Another observation, on the problematic host there are:
processes:
├─docker
│ ├─ff23c8be949c3ba2c14de3208b9388a629bec877ddb96645f6c190cddbf58051
│ │ └─14309 java ... -jar eesb-message-transfer-solution.jar ${@}
│ ├─0c5050bace8478454bc8b531615a8d23f11feabbb3e5ab2b0433aa80de7f5f79
│ │ └─14290 java -jar eesb-message-tester.jar ${@}
│ ├─e3562f77421677aad5985852375e614759367347f09367f76e04d4b9aff005c0
│ │ └─14257 java -jar eesb-consul-synchronizer.jar
│ ├─b38cdb30287cd597a17fe5616c1730f206d6301a03ce0d34de7f9a10bcada50d
│ │ └─14231 java ... -jar eesb-file-transfer-solution.jar
│ └─d48c42642090ae062427d81723a3701d45db7febd82d843c05f3e1096a379b16
│ ├─14146 /usr/bin/dumb-init /bin/sh /usr/local/bin/docker-entrypoint.sh agent -server
│ └─14308 consul agent -data-dir=/consul/data -config-dir=/consul/config -server
...
├─docker.service
│ ├─ 9740 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 32000 -container-ip 172.17.0.7 -container-port 8080
│ ├─13923 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
│ ├─14101 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8630 -container-ip 172.17.0.2 -container-port 8500
│ ├─14131 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 30100 -container-ip 172.17.0.3 -container-port 8080
│ ├─14156 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 30200 -container-ip 172.17.0.4 -container-port 8080
│ ├─14200 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 31400 -container-ip 172.17.0.5 -container-port 8080
│ └─14249 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 30000 -container-ip 172.17.0.6 -container-port 8080
....
├─containerd.service
│ ├─ 3193 /usr/bin/containerd
│ ├─14112 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/d48c42642090ae062427d81723a3701d45db7febd82d843c05f3e1096a379b16 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
│ ├─14157 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/b38cdb30287cd597a17fe5616c1730f206d6301a03ce0d34de7f9a10bcada50d -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
│ ├─14180 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/e3562f77421677aad5985852375e614759367347f09367f76e04d4b9aff005c0 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
│ ├─14212 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/0c5050bace8478454bc8b531615a8d23f11feabbb3e5ab2b0433aa80de7f5f79 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
│ └─14270 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/ff23c8be949c3ba2c14de3208b9388a629bec877ddb96645f6c190cddbf58051 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
....
The observed count inconsistency smells a bit. And indeed the missing container process if for the image.
The inconsistency between docker ps
and reality including docker exec/top/stop/inspect
points towards #555 .
In search for a root cause I destroyed the repro case.
I discovered that the initially observed docker exec problem is a follow-up problem. Docker starts to misbehave on the host B earlier, at the first docker stop <container>
for the container of the image.
In host B log there was:
Feb 1 14:46:26 eu50mqvq019 dockerd: time="2022-02-01T14:46:26.218331451+01:00" level=info msg="Container 442db04451ff10029d686d33770814b836bb01fdd9f22ef626559723e4d6e5d0 failed to exit within 61 seconds of signal 15 - using the force"
The docker stop <container>
had not returned (force failed too?) so after 5 mins I killed it.
In further experiment, if I use kill -15 <container-pid>
the container terminates instantly. It feeds a hypothesis that the container terminated but docker somehow misses the event so docker tries to force stop and continues waiting for an exit event of the meanwhile terminated container.
In a wide search comparing host A and host B for other possible differences/causes, I found that host B could not open a connection to a fluentd server while host A could open the connection so I switched the docker log driver from fluentd to json-file and it resolved the problem.
So far it looks like the fluentd driver can ruin the container operations without leaving meaningful diagnostical traces.
Expected behavior
Given
two hosts with the same kernel and the same docker:
respective containers should work on both or or none of the hosts.
Actual behavior
Depending of host the exec succeeds (A) or fails (B).
Steps to reproduce the behavior
Execute
on both hosts.
Here I have no clue why host B fails so sharing strace from both A and B in hope it might give a clue.
strace reported:
host A
host B
Output of
docker version
:Output of
docker info
:host A
host B
Additional environment details (AWS, VirtualBox, physical, etc.)
VMs