[x] I searched existing issues before opening this one
Expected behavior
Removal of containers/Swarm tasks succeeds, and system-wide tasks are unaffected.
Actual behavior
Swarm: Tasks have an Intended State of "Remove" but stick in a Current State of "Running"
Non-Swarm: Containers fail to remove. Subsequent attempts display a "Container removal already in progress" error. docker container prune operations freeze and never complete.
Steps to reproduce the behavior
Unfortunately we do not know root cause at this point, so we are unable to provide complete steps to reproduce.
Swarm
The problem is highly sporadic for us. On our Swarm side we're seeing Services stop responding (they appear to be unable to send network requests) and require an update --force to work correctly again. The # of running replicas remains the same each time. When we ps these services we see items stuck in "Shutdown" intended state. The below example is supposed to have 3 replicas:
$ sudo docker stack ps sentry
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
nhpbo2q1rw14 sentry_worker.1 vacasa/sentry:777eff115ee5b3b66f7b45e0147b4ae62a1cbdff ip-10-62-131-149 Running Running 19 hours ago
w4m4b4z0xqjj sentry_web.1 vacasa/sentry:777eff115ee5b3b66f7b45e0147b4ae62a1cbdff ip-10-62-129-102 Running Running 34 hours ago
wz8xi1w8m94h sentry_cron.1 vacasa/sentry:777eff115ee5b3b66f7b45e0147b4ae62a1cbdff ip-10-62-131-149 Running Running 3 weeks ago
vi3la6mzlzm1 sentry_web.1 vacasa/sentry:777eff115ee5b3b66f7b45e0147b4ae62a1cbdff m4dyfrsc6wpv5s966uwkl6arl Shutdown Running 3 weeks ago
lppbzg3vnlu9 sentry_cron.1 vacasa/sentry:777eff115ee5b3b66f7b45e0147b4ae62a1cbdff khcx23zej2i7ymgib7z75llq3 Shutdown Running 2 months ago
pntuknece9f9 sentry_worker.1 vacasa/sentry:777eff115ee5b3b66f7b45e0147b4ae62a1cbdff khcx23zej2i7ymgib7z75llq3 Shutdown Running 2 months ago
We have tried to remove some affected Stacks, after which we see output such as below where Intended State is now "Remove" rather than "Shutdown":
$ sudo docker stack ps admin
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
7lq4jgq1z9qh admin_queue_content.1 vacasa/admin:6377d228ad2dcd4b27a9d9a0e93ecfbc594dbb79 ip-10-62-131-232 Running Running 44 minutes ago
z64t49h4ei7f admin_queue_two_content.1 vacasa/admin:6377d228ad2dcd4b27a9d9a0e93ecfbc594dbb79 ip-10-62-131-232 Running Running 44 minutes ago
1bbg54ei2jm8 admin_queue_photodownload.1 vacasa/admin:6377d228ad2dcd4b27a9d9a0e93ecfbc594dbb79 ip-10-62-129-236 Running Running 44 minutes ago
pdupg0aasjke admin_queue_availability_content.1 vacasa/admin:6377d228ad2dcd4b27a9d9a0e93ecfbc594dbb79 ip-10-62-130-25 Running Running 44 minutes ago
jtsnvmx4mkd3 1015zc0j6meq0tc67uqw2sqc2.1 vacasa/admin:bde790ac185b1ca96f149b64c84c70a0ccc17e25 5epxl833x5ovhtkz369152pfi Remove Running 3 weeks ago
xpytwkrgwp86 v6m65rlgfvjp2o9l6gsbqmmkt.1 vacasa/admin:bde790ac185b1ca96f149b64c84c70a0ccc17e25 5epxl833x5ovhtkz369152pfi Remove Running 3 weeks ago
0zvqpsxlq2p3 q9qi7tx3jtq1wlscouprrihzg.1 vacasa/admin:bde790ac185b1ca96f149b64c84c70a0ccc17e25 5epxl833x5ovhtkz369152pfi Remove Running 3 weeks ago
ojtkqqnnm06l ki7g8lhgkmaoc4z9h9p6l3nrq.1 vacasa/admin:bde790ac185b1ca96f149b64c84c70a0ccc17e25 ibq9taxo3nzturkym57ea1fyk Remove Running 3 weeks ago
dpp2ugv5rl2k zp1z62txzc1vdpejmmx2e5m3h.1 vacasa/admin:bde790ac185b1ca96f149b64c84c70a0ccc17e25 ibq9taxo3nzturkym57ea1fyk Remove Running 3 weeks ago
Non-Swarm
On the non-Swarm side -- we have nodes which execute raw Docker Compose driven cron operations. Each cron has a setup approximating:
docker rm -f <container-name> | true
docker-compose run --rm --name=<container-name> service
Most of the time when these run, things finish as expected and the container is cleaned up (either at the end of the run or the beginning of the next run). Sometimes however we see an HTTP timeout message from Compose, and on every subsequent run dockerd will emit a "Container removal already in progress" error and fail our builds repeatedly.
docker-compose run --rm --name=rates-cron-kinesis-rates-stream-push push_updated_rates_to_kinesis_stream
[2018-11-13 14:12:07] live.INFO: App\Console\Commands\KinesisRatesStreamPush command triggered
[2018-11-13 14:12:07] live.INFO: App\Console\Commands\KinesisRatesStreamPush command completed
An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).
Most of the time as a troubleshooting measure we will attempt a docker container prune on the host. This task never succeeds, and indeed seems to make things worse as it appears to get held up on the previous container removal and destabilize the entire Engine. Restarting the instance clears whatever lock is occurring and allows the prune to complete.
Output of docker version:
Client:
Version: 18.06.1-ce
API version: 1.38
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:24:51 2018
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.1-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:23:15 2018
OS/Arch: linux/amd64
Experimental: false
These errors may have started occurring after an upgrade of our Swarm cluster/build pool from 18.05.0 to 18.06.1. This upgrade involved a full fleet replacement -- we added brand new instances and did not upgrade through apt.
The upgrade occurred approximately three weeks ago. A number of the items we've seen on the Swarm side still say "Running 3 weeks" as if they still think the original nodes are online.
Expected behavior
Removal of containers/Swarm tasks succeeds, and system-wide tasks are unaffected.
Actual behavior
docker container prune
operations freeze and never complete.Steps to reproduce the behavior
Unfortunately we do not know root cause at this point, so we are unable to provide complete steps to reproduce.
Swarm
The problem is highly sporadic for us. On our Swarm side we're seeing Services stop responding (they appear to be unable to send network requests) and require an
update --force
to work correctly again. The # of running replicas remains the same each time. When we ps these services we see items stuck in "Shutdown" intended state. The below example is supposed to have 3 replicas:We have tried to remove some affected Stacks, after which we see output such as below where Intended State is now "Remove" rather than "Shutdown":
Non-Swarm
On the non-Swarm side -- we have nodes which execute raw Docker Compose driven cron operations. Each cron has a setup approximating:
Most of the time when these run, things finish as expected and the container is cleaned up (either at the end of the run or the beginning of the next run). Sometimes however we see an HTTP timeout message from Compose, and on every subsequent run dockerd will emit a "Container removal already in progress" error and fail our builds repeatedly.
Most of the time as a troubleshooting measure we will attempt a
docker container prune
on the host. This task never succeeds, and indeed seems to make things worse as it appears to get held up on the previous container removal and destabilize the entire Engine. Restarting the instance clears whatever lock is occurring and allows theprune
to complete.Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.)
Some extra context that may be relevant:
apt
.