Dockerd 18.06.1 not removing containers

[x] This is a bug report
[ ] This is a feature request
[x] I searched existing issues before opening this one

Expected behavior

Removal of containers/Swarm tasks succeeds, and system-wide tasks are unaffected.

Actual behavior

Swarm: Tasks have an Intended State of "Remove" but stick in a Current State of "Running"
Non-Swarm: Containers fail to remove. Subsequent attempts display a "Container removal already in progress" error. docker container prune operations freeze and never complete.

Steps to reproduce the behavior

Unfortunately we do not know root cause at this point, so we are unable to provide complete steps to reproduce.

Swarm

The problem is highly sporadic for us. On our Swarm side we're seeing Services stop responding (they appear to be unable to send network requests) and require an update --force to work correctly again. The # of running replicas remains the same each time. When we ps these services we see items stuck in "Shutdown" intended state. The below example is supposed to have 3 replicas:

$ sudo docker stack ps sentry
ID                  NAME                  IMAGE                                                    NODE                        DESIRED STATE       CURRENT STATE           ERROR               PORTS
nhpbo2q1rw14        sentry_worker.1       vacasa/sentry:777eff115ee5b3b66f7b45e0147b4ae62a1cbdff   ip-10-62-131-149            Running             Running 19 hours ago
w4m4b4z0xqjj        sentry_web.1          vacasa/sentry:777eff115ee5b3b66f7b45e0147b4ae62a1cbdff   ip-10-62-129-102            Running             Running 34 hours ago
wz8xi1w8m94h        sentry_cron.1         vacasa/sentry:777eff115ee5b3b66f7b45e0147b4ae62a1cbdff   ip-10-62-131-149            Running             Running 3 weeks ago
vi3la6mzlzm1        sentry_web.1          vacasa/sentry:777eff115ee5b3b66f7b45e0147b4ae62a1cbdff   m4dyfrsc6wpv5s966uwkl6arl   Shutdown            Running 3 weeks ago
lppbzg3vnlu9        sentry_cron.1         vacasa/sentry:777eff115ee5b3b66f7b45e0147b4ae62a1cbdff   khcx23zej2i7ymgib7z75llq3   Shutdown            Running 2 months ago
pntuknece9f9        sentry_worker.1       vacasa/sentry:777eff115ee5b3b66f7b45e0147b4ae62a1cbdff   khcx23zej2i7ymgib7z75llq3   Shutdown            Running 2 months ago

We have tried to remove some affected Stacks, after which we see output such as below where Intended State is now "Remove" rather than "Shutdown":

$ sudo docker stack ps admin
ID                  NAME                                              IMAGE                                                   NODE                         DESIRED STATE       CURRENT STATE      ERROR               PORTS
7lq4jgq1z9qh        admin_queue_content.1                             vacasa/admin:6377d228ad2dcd4b27a9d9a0e93ecfbc594dbb79   ip-10-62-131-232             Running             Running 44 minutes ago
z64t49h4ei7f        admin_queue_two_content.1                         vacasa/admin:6377d228ad2dcd4b27a9d9a0e93ecfbc594dbb79   ip-10-62-131-232             Running             Running 44 minutes ago
1bbg54ei2jm8        admin_queue_photodownload.1                       vacasa/admin:6377d228ad2dcd4b27a9d9a0e93ecfbc594dbb79   ip-10-62-129-236             Running             Running 44 minutes ago
pdupg0aasjke        admin_queue_availability_content.1                vacasa/admin:6377d228ad2dcd4b27a9d9a0e93ecfbc594dbb79   ip-10-62-130-25              Running             Running 44 minutes ago
jtsnvmx4mkd3        1015zc0j6meq0tc67uqw2sqc2.1                       vacasa/admin:bde790ac185b1ca96f149b64c84c70a0ccc17e25   5epxl833x5ovhtkz369152pfi    Remove              Running 3 weeks ago
xpytwkrgwp86        v6m65rlgfvjp2o9l6gsbqmmkt.1                       vacasa/admin:bde790ac185b1ca96f149b64c84c70a0ccc17e25   5epxl833x5ovhtkz369152pfi    Remove              Running 3 weeks ago
0zvqpsxlq2p3        q9qi7tx3jtq1wlscouprrihzg.1                       vacasa/admin:bde790ac185b1ca96f149b64c84c70a0ccc17e25   5epxl833x5ovhtkz369152pfi    Remove              Running 3 weeks ago
ojtkqqnnm06l        ki7g8lhgkmaoc4z9h9p6l3nrq.1                       vacasa/admin:bde790ac185b1ca96f149b64c84c70a0ccc17e25   ibq9taxo3nzturkym57ea1fyk    Remove              Running 3 weeks ago
dpp2ugv5rl2k        zp1z62txzc1vdpejmmx2e5m3h.1                       vacasa/admin:bde790ac185b1ca96f149b64c84c70a0ccc17e25   ibq9taxo3nzturkym57ea1fyk    Remove              Running 3 weeks ago

Non-Swarm

On the non-Swarm side -- we have nodes which execute raw Docker Compose driven cron operations. Each cron has a setup approximating:

docker rm -f <container-name> | true
docker-compose run --rm --name=<container-name> service

Most of the time when these run, things finish as expected and the container is cleaned up (either at the end of the run or the beginning of the next run). Sometimes however we see an HTTP timeout message from Compose, and on every subsequent run dockerd will emit a "Container removal already in progress" error and fail our builds repeatedly.

docker-compose run --rm --name=rates-cron-kinesis-rates-stream-push push_updated_rates_to_kinesis_stream
[2018-11-13 14:12:07] live.INFO: App\Console\Commands\KinesisRatesStreamPush command triggered  

[2018-11-13 14:12:07] live.INFO: App\Console\Commands\KinesisRatesStreamPush command completed  

An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

Most of the time as a troubleshooting measure we will attempt a docker container prune on the host. This task never succeeds, and indeed seems to make things worse as it appears to get held up on the previous container removal and destabilize the entire Engine. Restarting the instance clears whatever lock is occurring and allows the prune to complete.

Output of docker version:

Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:24:51 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:23:15 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 14
 Running: 6
 Paused: 0
 Stopped: 8
Images: 8
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: xj572auet31chuaqpc47ofwke
 Is Manager: true
 ClusterID: zxp0ywegqhrvvm32vx0eklsyd
 Managers: 7
 Nodes: 45
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.62.11.35
 Manager Addresses:
  10.62.11.103:2377
  10.62.11.109:2377
  10.62.11.121:2377
  10.62.11.138:2377
  10.62.11.157:2377
  10.62.11.35:2377
  10.62.11.48:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-1023-aws
Operating System: Ubuntu 18.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 14.93GiB
Name: ip-10-62-11-35
ID: IUP4:LSIU:HL2J:XPEF:LSSG:Y5IV:GDIC:2ZBL:M6NG:K4GZ:Q65J:WRHJ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 scope=manager
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.)

Some extra context that may be relevant:

These errors may have started occurring after an upgrade of our Swarm cluster/build pool from 18.05.0 to 18.06.1. This upgrade involved a full fleet replacement -- we added brand new instances and did not upgrade through apt.
The upgrade occurred approximately three weeks ago. A number of the items we've seen on the Swarm side still say "Running 3 weeks" as if they still think the original nodes are online.

docker / for-linux