Closed djeeg closed 6 years ago
I thought I was fine just ignoring the extra container, considering it was from a stack that had been removed and re-deployed
Until connecting to the redis service started to fail in some nodejs services
Error: connect ECONNREFUSED 10.0.29.14:7000
I checked the state to see if the ghost container is still there, yep it is, still alive after 8 days.
$ docker ps | grep "redis"
9e293525f7d4 "docker-entrypoint.s…" 30 hours ago Up 30 hours (healthy) 6379/tcp, 7000/tcp redis.1.ugjl5t7494a1fly5c0emodyz9
eff366c1acb0 "docker-entrypoint.s…" 8 days ago Up 8 days (healthy) 6379/tcp, 7000/tcp redis.1.pfma6polo81rpxx9cxoz4pqt5
It's odd, because the container is still hanging around in the stack, though it is now marked as Remove and the ID has changed to a hash code
$ docker stack ps --no-trunc data
ugjl5t7494a1fly5c0emodyz9 redis.1 swarm-manager000000 Running Running 30 hours ago
pfma6polo81rpxx9cxoz4pqt5 ydqo5bc<removed>.1 swarm-manager000000 Remove Running 8 days ago
Stack replicas look okay
$ docker stack services data
ID NAME MODE REPLICAS
r22m0o2v5meo redis replicated 1/1
To debug why the nodejs services were failing, I thought to jump onto the container and check the network
$ docker exec -it 9e293525f7d4 /bin/bash
root@9e293525f7d4:/data# getent hosts
127.0.0.1 localhost
127.0.0.1 localhost ip6-localhost ip6-loopback
10.0.29.36 9e293525f7d4
Hang on, this IP is different from the IP reported in the nodejs error
I'll try a different node service, dump a few container ids to test
Welcome to Docker!
swarm-worker000000:~$ docker ps | grep microse
4c6c48729de0 "/tini -- /start.sh" 30 minutes ago Up 30 minutes microservices1_2_serviceauths.1.qggswpvudcqlb6kgmb5yyk0rp
f1a114ceaa23 "/tini -- /start.sh" 30 minutes ago Up 30 minutes microservices1_2_serviceapi.1.82x3tzwadm43sk6fdubdoph3a
swarm-worker000000:~$ docker exec -it $(docker ps | grep "_serviceapi.1" | awk '{print $1}') /bin/bash
root@f1a114ceaa23:/serviceroot# ping redis
PING redis (10.0.29.14): 56 data bytes
swarm-worker000000:~$ docker exec -it $(docker ps | grep "_serviceauths.1" | awk '{print $1}') /bin/bash
root@4c6c48729de0:/serviceroot# ping redis
PING redis (10.0.29.35): 56 data bytes
Okay the nodejs error is making sense, there is an extra IP in the DNS, I am guessing the ghost/orphaned container
$ getent hosts tasks.redis
10.0.29.15 tasks.redis
10.0.29.36 tasks.redis
Try to connect to the orphaned container
$ docker exec -it eff366c1acb0 /bin/bash
OCI runtime exec failed: exec failed: container_linux.go:296: starting container process caused "process_linux.go:86: executing setns process caused \"exit status 21\"": unknown
Not sure how to force remove this container from the service
Container still alive
$ docker inspect data_redis.1.pfma6polo81r
[
{
"Id": "eff366c1acb0896ed9084c07743<removed>",
"Created": "2018-03-03T18:29:09.879179113Z",
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 9937,
"ExitCode": 0,
"Error": "",
"StartedAt": "2018-03-03T18:29:11.964461773Z",
"FinishedAt": "0001-01-01T00:00:00Z",
},
"Name": "/data_redis.1.pfma6polo81r<removed>",
"NetworkSettings": {
"Networks": {
"netmid": {
"IPAMConfig": {
"IPv4Address": "10.0.29.15"
},
"NetworkID": "kytls4tqjg4iu0<removed>",
"EndpointID": "97a86a57aedef4c41a7621dcd3234178f20<removed>",
"Gateway": "",
"IPAddress": "10.0.29.15",
}
}
}
}
]
Also in the network
$ docker network inspect netmid
[
{
"Name": "netmid",
"Scope": "swarm",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "10.0.29.0/24",
"Gateway": "10.0.29.128"
}
]
},
"Internal": false,
"Attachable": true,
"Ingress": false,
"Containers": {
"9e293525f7d4bc91fa14a1388e103b5<removed>": {
"Name": "data_redis.1.ugjl5t7494a1fly<removed>",
"EndpointID": "d7c43657acdabc8ed2faf9c0a7a<removed>",
"MacAddress": "02:42:0a:00:1d:24",
"IPv4Address": "10.0.29.36/24",
"IPv6Address": ""
},
"eff366c1acb0896ed9084c077433149<removed>": {
"Name": "data_redis.1.pfma6polo81rpxx<removed>",
"EndpointID": "97a86a57aedef4c41a7621dcd3234<removed>",
"MacAddress": "02:42:0a:00:1d:0f",
"IPv4Address": "10.0.29.15/24",
"IPv6Address": ""
}
}
}
]
Sounds like more of a stack/service cleanup issue (not a DNS issue)
I was able to force the corrupt container from the network using Portainer That at least corrects the DNS
Though the container still remains unremovable
By default the container will keep being re-deployed by swarm, as there's a desired state of it being running.
In order to get a better understanding of what's going on, it would be helpful to get the actual task history and overall service deployment logs.
After you do $ docker stack rm data
what services are shown? Can you wait for all containers to be removed? Killing the container directly will only force the orchestrator to deploy it again.
By default the container will keep being re-deployed by swarm
It doesnt look like the container ID is changing over time, it stays as pfma6polo81... If the orchestrator was redeploying the container I should expect to see a new container ID each time?
After you do docker stack rm data what services are shown
$ docker stack rm data
Removing service data_redis
$ docker stack services data
Nothing found in stack: data
$ docker service ls | grep redis
<no output>
$ docker ps | grep redis
eff366c1acb0 "docker-entrypoint.s…" 9 days ago Up 9 days (unhealthy) data_redis.1.pfma6polo81
Can you wait for all containers to be removed
Waited 2 hrs, the container still is here
task history (was nothing too fancy)
azure provision 17.12.0 swarm from template
docker stack deploy data
(redis service with only 1 replica)
application working for around 4-8 hrs at some point nodejs services could not connect to redis error in redis logs = Errors writing to the AOF file: No such file or directory (maybe cloudstor related, havent seen this error since swapping to driver=local volume on the redeployed stack)
First, tried a simple recycle of the unresponsive service
docker service scale data_redis=0
docker stack services data
showed REPLICAS=1/0 for a long time
Next, tried a full refresh of stack
docker stack rm data
docker stack deploy data
docker stack services data
showed REPLICAS=1/1, and application started working (each deployment of the stack/service gets a new ID right, they dont share configuration?)
Checked containers
docker ps | grep redis
Showing 2 containers
Tried to manually remove the orphaned container (from the first deployed stack)
docker stop rediscontainerID
hangs
docker kill rediscontainerID
hangs
docker rm rediscontainerID
Error response from daemon: removal of container rediscontainerID is already in progress
docker rm --force rediscontainerID
Error response from daemon: removal of container rediscontainerID is already in progress
Gave up for 8 days
Noticed that DNS entries for tasks.redis service still had both containers listed Portainer -> networks -> networks that rediscontainerID joined to -> remove each network DNS fixed Application stable
$ docker logs eff366c1acb0
Error response from daemon: can not get logs from container which is dead or marked for removal
Some more that might be related
https://github.com/moby/moby/issues/32620 https://github.com/moby/moby/issues/34488 https://github.com/moby/moby/issues/35910 https://github.com/moby/moby/issues/35933
I decided to have a go at cleaning up this ghost container
My plan was:
I got to step 5 before problems
Unfortunately it looks like whatever hung the original container, is also preventing the VMSS instance from any of these Azure UI commands:
Azure portal gets stuck with status "Updating..." for more than 30 minutes
Had to fall back to completely deleting the VMSS instance
Remove-AzureRmVmss -ResourceGroupName "NAME" –VMScaleSetName "swarm-manager-vmss" -InstanceId 0
However, it then seemed like the swarm lost quorum
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Possibly related https://github.com/moby/moby/issues/34466 (solution was to rebuild the cluster) https://github.com/moby/moby/issues/25432#issuecomment-302227970 (however in my case I confirmed the transfer of the Leader) https://github.com/moby/moby/issues/34722
Tried to reinitialise via:
docker swarm init --force-new-cluster
docker node rm --force <deadnodeid>
Unfortunately now it looks like the overlay network has died
Possibly related https://github.com/moby/moby/issues/34987#issuecomment-371217483 https://forums.docker.com/t/new-leader-election-failed/37760 (solution was to rebuild the cluster)
Looks like its time to rebuild the swarm again
Summary:
To be clear pfma6polo81rpxx9cxoz4pqt5
is the task ID that is given to you when you execute the docker service ps ...
Removing the manager node will mean a complete lost of quorum - I'd like to run the exact same app stack, for the same amount of time on a test cluster. Could you join our Docker Community Slack and send it to me?
About 20 nodejs containers read/writing to 1 redis container
In Azure Storage Explorer I see two files
Sounds like redis crashed
is redis crashing due to the read/write on the cloudstor volume, or other errors are popping up?
for the last 8 days redis has run without issue using a local volume could have just been a one-off glitch? definately something to be wary of
ill close this issue, as i've rebuilt the swarm, i cant get any extra logs
(btw: this was the standard cloudstor:azure volume, I have been playing around with other cloudstor plugin aliases for writing seperate logs/backups)
@djeeg understood, however given the errors and the logs provided, it's difficult to pin-point if this is an issue at the volume level, at the infrastructure level, or simply at the application level.
Having an exact replication (aka your exact deployment scope, all commands typed, and copy of the stack file used) would allow us to determine what is going on. I can hypothesis all day, and use a "similar" deployment, that may end up getting us nowhere.
Hi,
Yesterday I went ahead and performed a full swarm rebuild of my azure swarm The migration from 17.09-ish to 17.12.0 seemed to go smoothly I got everything working correctly before knocking off of the day
This morning I found the application isnt working and see lots of errors from a redis service
First I try restarting the service
However the service doesnt scale down
I can still see the container running
I try removing the stack completely, however the container is still running
Next I try redeploying the stack
Seems to bring the application back to life, though i now have another ghost container
Try to end the container, command hangs the console
$ docker kill eff366c1acb0
Do some digging into docker logs, these seems related to the redis error [could not fetch metadata: cannot read metadata: open /mnt/cloudstor/cloudstor-metadata] [no such file or directory]
I see that [Failed to deserialize netlink ndmsg: Link not found] has been fixed with https://github.com/docker/libnetwork/issues/2045, could it be related to the cloudstor failure?
Current stack config (though I will change this now to replace cloudstor with a local volume)
And redis config
docker info