Open ambrons opened 6 years ago
I did some testing in us-east-1
last week and 50GB, ios, and 1000 iops and it was working fine. However now that I'm using ap-southeast-1
it appears to be failing with the above error. Around the 3 minute mark:
$ swarm service create \
> --name cassandra \
> --network data \
> --update-delay 300s \
> --replicas 1 \
> --with-registry-auth \
> --env LOCAL_JMX=no \
> --env SERVICE_NAME=cassandra \
> --constraint 'node.role != manager' \
> --reserve-memory 3gb \
> --mount type=volume,volume-driver=cloudstor:aws,source=asports-prod-{{.Service.Name}}-{{.Task.Slot}},destination=/var/lib/cassandra,volume-opt=backing=relocatable,volume-opt=size=150,volume-opt=ebstype=io1,volume-opt=iops=1000,volume-opt=ebs_tag_Name=asports-prod-{{.Service.Name}}-{{.Task.Slot}} \
> xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm
xdlfmtzojphbuoiqd5feuhdes
overall progress: 0 out of 1 tasks
1/1: Post http://%2Frun%2Fdocker%2Fplugins%2F483cd69b6d69e5aa11bdf44a3b13345aed…
Just to make sure it wasn't anything else I removed the mount
argument and the service runs fine with a standard container volume. The issue is only when using the EBS backed relocatable volumes that it fails.
I've also tried letting the service definition create the volume on start and that didn't seem to help either.
Here's the updated configuration:
docker service create \
--name cassandra \
--network data \
--update-delay 300s \
--replicas 1 \
--with-registry-auth \
--env LOCAL_JMX=no \
--env SERVICE_NAME=cassandra \
--constraint 'node.role != manager' \
--reserve-memory 3gb \
--mount type=volume,volume-driver=cloudstor:aws,source=asports-prod-{{.Service.Name}}-{{.Task.Slot}},destination=/var/lib/cassandra,volume-opt=backing=relocatable,volume-opt=size=150,volume-opt=ebstype=io1,volume-opt=iops=1000,volume-opt=ebs_tag_Name=asports-prod-{{.Service.Name}}-{{.Task.Slot}} \
xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm
The plugin seems to be working fine
$ docker plugin ls
ID NAME DESCRIPTION ENABLED
e3c2802690d7 cloudstor:aws cloud storage plugin for Docker true
@ddebroy Do you have any thoughts? I'm dead in the water for our deployment as this doesn't appear to work as advertised.
Same problem - cloudstor:aws
creates and deletes volumes successfully, but hangs when I try to start container.
# docker volume create -d "cloudstor:aws" --opt ebstype=gp2 --opt size=10 mylocalvol1
mylocalvol1
# docker volume ls
DRIVER VOLUME NAME
cloudstor:aws mylocalvol1
# docker run -it -v mylocalvol1:/mnt debian bash
... nothing after 10 minutes ...
Hello everyone. I'm having a similar problem. I am trying to create an EBS via cloudstor, I made the configuration as below:
version: '3'
services:
rabbitmq:
image: rabbitmq:3.6-management-alpine
networks:
- my-network
ports:
- 5672:5672
- 15672:15672
volumes:
- rabbitmq_data_staging:/var/lib/rabbitmq
logging:
driver: "awslogs"
options:
awslogs-region: "us-east-1"
awslogs-group: "queues"
awslogs-stream: "rabbitmq-staging"
deploy:
replicas: 1
placement:
constraints:
- node.labels.mylabel == mylabelvalue
restart_policy:
condition: on-failure
networks:
my-network:
external: true
volumes:
rabbitmq_data_staging:
driver: "cloudstor:aws"
driver_opts:
size: "5"
ebstype: "gp2"
backing: "relocatable"
Every time I make a deploy command in swarm, it simply does not raise the container, without giving an error or something. This command is made from a manager to a worker.
When I run the command in the manger itself it works normally.
When I take out the command to mount the volume,
volumes:
- rabbitmq_data_staging:/var/lib/rabbitmq
the container goes up normally.
I tested other plugins like this and had the same problem I tested other plugins like rexray, and had the same problem. Which makes me think that it is some incompatibility of swarm with the plugin. Can anyone help, or even tell if I'm doing something wrong on my docker-compose?
The cloudstor normally creates the EBS volume, without any problem.
The versions of the plugins are:
ID NAME DESCRIPTION ENABLED
ed0c2ebcfc92 rexray/ebs:latest REX-Ray for Amazon EBS false
208c7b943f6d cloudstor:aws cloud storage plugin for Docker true
@ddebroy , you can help us? Thank You!
have the same issue with docker4aws 18.03 (stable) and 18.04 (edge) cloudformation templates. Hadn't had the issue with docker4aws 17.12 (edge).
Any news on this?
I ran into similar issues using it with ECS - I found that it worked with T2s and C4's, but would fail in this manner with C5/M5...might help debug the root issue.
@dodgemich you are my hero!! I spend two days to understand why rexray and cloudstor doesn't work on my new shiny t3 cluster. And I just have to migrate it on t2.
may be #148 is related, I get issues like the above and with the mount point /dev/xvdf allready existing when trying to mount cloudstore:aws
volumes to modern generation aws instances. Apparently this may have something todo with nvme drivers on that hardware
Having the exact same issue with rexray/efs... Did anyone manage to find a solution?
@lepetitpierdol you have to use instances from previous generations - t2, c4 and so on. Looks like latest T3 and C5 have new disk controllers that don't work with rexray/convoy/cloudstor
No luck for me, I'm having issues with T2 on docker4aws 18.06.1 (stable) and 18.01 (edge) when mounting volumes using cloudstor.
In my case, I got this error when one of the containers got stuck and could not be stopped. It was holding a reference to a volume, so new container could not be started. I resolved this by rebooting the host VM.
There's been some PRs in REX-Ray to handle the new NVMe device names (rexray/rexray#1233, rexray/rexray#1252). I've run the edge release successfully to create and mount EBS volumes on current generation instances.
We need a similar change in Cloudstor. I really wish Docker would at least give some indication of whether they're even going to address this issue. Or, open source the code so that we can do something about it.
@kinghuang you couldn't have said better. I tried REX-Ray, but it's not enough for my use case. I'm using Cloudstor currently on Amazon ECS, but I'm forced to use the old instance types.
@brawong, @joeabbey sorry to mention you guys, but do you have any feedback on when the NVMe devices would be supported in Cloudstor, so we could use it on the new AWS EC2 generations (t3, m5, c5, etc.)?
I am trying to create volumes and I am running into the same problem. My cluster is based on T2 instances, so that does not seem to be the source of the problem. Docker version 18.06.1-ce, build e68fc7a.
Version 18.03.0 works fine, I rollback my stack and no more issues with cloudstor plugin.
— Gabriel Reitz Giannattasio On Oct 24, 2018, 7:36 AM -0700, Rafael Guimaraes Siqueira notifications@github.com, wrote:
I am trying to create volumes and I am running into the same problem. My cluster is based on T2 instances, so that does not seem to be the source of the problem. Docker version 18.06.1-ce, build e68fc7a. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
I'm experiencing the same problem. I have 18.06.1-ce:
Status": { "Timestamp": "2018-10-25T16:18:57.558536532Z", "State": "preparing", "Message": "preparing", "Err": "create testapp_web: Post http://%2Frun%2Fdocker%2Fplugins%2Fcd8e305e0fc9d1030761f3dfc3a873f4923ec78a669489fb15d3123df6f1c10b%2Fcloudstor.sock/VolumeDriver.Create: context deadline exceeded", "PortStatus": {} },
I restarted EC2 instances, the swarm ended up with the master and a couple workers out of the swarm after the reboot. Then all services went up (previously recreating stack), but I have now another problem, and I think is related, but have no evidence.
Anybody find a solution here? Just started seeing this issue.
I haven't found any solutions for Cloudstor. I've started to use REX-Ray, but it has the downside that it doesn't copy EBS volumes between availability zones.
We really need Docker to provide an answer.
thanks @kinghuang. any tips or pointers to documentation on REX-Ray, in case we need to go that route?
@gartz was rolling back your stack as easy as running the cloudformation template with version 18.03.0 specified?
@mateodelnorte yes, it rollback, but I need to login in the new manager and force-initialize it to work, after that the workers and other managers start working again.
I also edited my cloud formation template to add EFS support to N. California (it's disabled in the original, but N. California supports it).
Currently attempting to update our CloudFormation template from 18.06.1 to 18.03.1. Our new manager came online but is clearly in an odd state:
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
24fr9wl3maq76rwcc4j6w28q0 ip-172-22-6-98.ec2.internal Ready Active Reachable 18.06.1-ce
1jdanx00fjqily9ev6rtkz158 ip-172-22-7-254.ec2.internal Ready Active 18.06.1-ce
e3ez03wlf33aohmktfnfnwaym ip-172-22-17-55.ec2.internal Down Active Reachable 18.03.0-ce
fsvz7lhywdetcgunx005gndgq ip-172-22-17-55.ec2.internal Ready Active Unreachable 18.03.0-ce
xe4sd7jp9kbpln1ysfy1dojq3 ip-172-22-17-249.ec2.internal Ready Active 18.06.1-ce
blqwttvjaaxt8z1z79ohdb0le ip-172-22-22-66.ec2.internal Ready Active Leader 18.06.1-ce
x09n7onls3cutd4cu530o60i8 ip-172-22-34-115.ec2.internal Ready Active 18.06.1-ce
jadehezhrfgrazpjnr3i972gd * ip-172-22-40-45.ec2.internal Ready Active Reachable 18.06.1-ce
Every 2s: docker node ls 2018-11-02 23:18:54
Notice ip-172-22-17-55.ec2.internal is listed twice. That's the new manager. It's registering as both Ready and Down.
docker info
on the new manager yields:
docker info
Containers: 5
Running: 5
Paused: 0
Stopped: 0
Images: 5
Server Version: 18.03.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: fsvz7lhywdetcgunx005gndgq
Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Is Manager: true
Node Address: 172.22.17.55
Manager Addresses:
172.22.17.55:2377
172.22.17.55:2377
172.22.22.66:2377
172.22.40.45:2377
172.22.6.98:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.81-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.785GiB
Name: ip-172-22-17-55.ec2.internal
ID: OBAN:2FHN:UX7C:BHOR:DIVY:27HI:SSAI:KSVU:NVXP:WJW7:VCX3:QWFF
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
os=linux
region=us-east-1
availability_zone=us-east-1b
instance_type=m4.large
node_type=manager
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
~ $ docker service ls
Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I'm not confident a swarm init --force-new-cluster
on this new node will result in success. I would think it doesn't have service configuration, since it can't join and make quorum.
@gartz is this the situation you were in when you forced a new cluster?
No, you're in a new situation, the version I'm using that work is the
18.03.0
, not the 18.03.1
.
If you get the context deadline exceeded error DO NOT run swarm init --force-new-cluster
the problem seems to be in the cloudstor plugin, you
might lose your data forcing a new cluster without being able to
communicate with EFS correctly.
On Fri, Nov 2, 2018 at 4:24 PM Matt Walters notifications@github.com wrote:
Currently attempting to update our CloudFormation template from 18.06.1 to 18.03.1. Our new manager came online but is clearly in an odd state:
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION 24fr9wl3maq76rwcc4j6w28q0 ip-172-22-6-98.ec2.internal Ready Active Reachable 18.06.1-ce 1jdanx00fjqily9ev6rtkz158 ip-172-22-7-254.ec2.internal Ready Active 18.06.1-ce e3ez03wlf33aohmktfnfnwaym ip-172-22-17-55.ec2.internal Down Active Reachable 18.03.0-ce fsvz7lhywdetcgunx005gndgq ip-172-22-17-55.ec2.internal Ready Active Unreachable 18.03.0-ce xe4sd7jp9kbpln1ysfy1dojq3 ip-172-22-17-249.ec2.internal Ready Active 18.06.1-ce blqwttvjaaxt8z1z79ohdb0le ip-172-22-22-66.ec2.internal Ready Active Leader 18.06.1-ce x09n7onls3cutd4cu530o60i8 ip-172-22-34-115.ec2.internal Ready Active 18.06.1-ce jadehezhrfgrazpjnr3i972gd * ip-172-22-40-45.ec2.internal Ready Active Reachable 18.06.1-ce Every 2s: docker node ls 2018-11-02 23:18:54
Notice ip-172-22-17-55.ec2.internal is listed twice. That's the new manager. It's registering as both Ready and Down.
docker info on the new manager yields:
docker info Containers: 5 Running: 5 Paused: 0 Stopped: 0 Images: 5 Server Version: 18.03.0-ce Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: active NodeID: fsvz7lhywdetcgunx005gndgq Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded Is Manager: true Node Address: 172.22.17.55 Manager Addresses: 172.22.17.55:2377 172.22.17.55:2377 172.22.22.66:2377 172.22.40.45:2377 172.22.6.98:2377 Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c runc version: 4fc53a81fb7c994640722ac585fa9ca548971871 init version: 949e6fa Security Options: seccomp Profile: default Kernel Version: 4.9.81-moby Operating System: Alpine Linux v3.5 OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 7.785GiB Name: ip-172-22-17-55.ec2.internal ID: OBAN:2FHN:UX7C:BHOR:DIVY:27HI:SSAI:KSVU:NVXP:WJW7:VCX3:QWFF Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: os=linux region=us-east-1 availability_zone=us-east-1b instance_type=m4.large node_type=manager Experimental: true Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
~ $ docker service ls Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I'm not confident a swarm init --force-new-cluster on this new node will result in success. I would think it doesn't have service configuration, since it can't join and make quorum.
@gartz https://github.com/gartz is this the situation you were in when you forced a new cluster?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/for-aws/issues/157#issuecomment-435537400, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGFGIqzpm93lg1EnFoBcpu2_orfEAwRks5urNQpgaJpZM4UKx62 .
That was a typo on my part. The non-connecting manager, and upgrade version we're attempting to move toward is 18.03.0-ce
.
I also experience this issue on both Stable and Edge. I tried to downgrade stable to 18.03.0-ce
but with no luck. I deploy using a docker-compose with:
volumes:
gc2core_var_www_geocloud2:
driver: cloudstor:aws
driver_opts:
:backing: shared
CURRENT STATE just keep saying "Preparing [--] minutes ago"
EDIT 1: Just got the service up and running on a clean install of T2 instances using Edge
EDIT 2:
When I tried to deploy with 2 replicas it took like 19 minutes for one service to get running. One replica did thrown context deadline exceeded: volume name must be unique
. But all got up and running eventually.
My case I could downgrade in one of the CloudFormation and the problem was solved, in the other CloudFormation the downgrade it self-didn't fix the problem, so I did a new CloudFormation using the version 18.03.0-ce then moved data from broken EFS to the new EFS mounting it manually in an EC2 temporary instance. Finally, I started my docker services in the new CloudFormation and as it goes it detected the folders in the EFS and it worked.
Don't forget that you need to use the CloudFormation file from
18.03.0-ce
not only change the version in the current file, just change
the version in the file won't change the AMI used to spawn Instances.
I hope this information helps. It's a very frustrating problem, hard to detect and hard to fix.
Downgrading to 18.03.0-ce from 18.06.1-ce (where I was experiencing the same issue) worked for me too.
In terms of NVMe support, is this getting addressed? (Seems like two issues discussed in the comments).
@FrenchBen handled https://github.com/docker/for-aws/issues/148 for root NVMe - perhaps he has some insight on adding in to Cloudstor??
Yeah, I think the comments here are describing two different problems.
There's been zero communication from Docker about either problem, AFAIK.
Yeah, I think the comments here are describing two different problems.
1. Some users are having problems with Cloudstor on 18.06.1, regardless of whether NVMe volumes are being used. Downgrading to 18.03.0 appears to be the solution for these users. 2. Others (myself included) want Cloudstor updated to handle NVMe volumes on current generation instances.
There's been zero communication from Docker about either problem, AFAIK.
Agreed - my issue is (2). Not sure if worth cutting a new ticket to split them up...or how to get better info from Docker on when they'll address...without addressing that, Cloudstor is basically on the path for retirement.
That’s a good idea. I’ll create an issue for the second issue (NVMe mounts on current generation instances).
Isn’t Cloudstor also part of Docker EE on AWS (Docker Certified Infrastructure)?
Created #184 for the second issue.
Same error here, deploying a new stack raise an error context deadline exceeded
, updating the stack raise a similar error: context deadline exceeded: volume name must be unique
.
The inability to create shareable volumes between instances creates enormous problems and I can no longer use half the services..
Any update on this issue?
I'm hitting this now on t3 instances.
I'm hitting this now on t3 instances.
Me too
All "Nitro" Based instances are affected, which make use of the new "/dev/nvme*" block devices. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html#ec2-nitro-instances
Workarounds:
When you are experiencing issues with a current generation instance; e.g. one that does not use the new block device names yet - some have reported that downgrading to the 18.03 driver alleviates the problem. I can not personally confirm this, as I have only dealt with the former problem myself.
I was also one of the people who reported this issue when it popped up for RexRay as well - in case it helps with prompt resolution of the cludstor issue, here is the relevant issue in their GitHub: https://github.com/rexray/rexray/pull/1252
Expected behavior
Service starts with attached EBS Volume attached
Actual behavior
My assumption is that it's taking too long to snapshot and load the EBS volume for a specific availability zone and therefore times out.
Note: the EBS volumes are 200GB, however they're currently empty.
The initial error is this:
After subsequent retires to start the service I get this error:
The service never seems to start.
Information
Docker-diagnose: 1527092193-JGugtUgVNBmvU7S8tXn0mV4ryIhPF4zc
Volumes created:
AWS Region: ap-southeast-1
Service Creation Setup: