docker-archive / for-aws

92 stars 26 forks source link

Service fails to start with Cloudstor EBS Volume attached #157

Open ambrons opened 6 years ago

ambrons commented 6 years ago

Expected behavior

Service starts with attached EBS Volume attached

Actual behavior

My assumption is that it's taking too long to snapshot and load the EBS volume for a specific availability zone and therefore times out.

Note: the EBS volumes are 200GB, however they're currently empty.

The initial error is this:

$ swarm service ps nvkmbnmc9nwyn8ojw3v1bkjzh --no-trunc
ID                          NAME                IMAGE                                                                                                                                                     NODE                                               DESIRED STATE       CURRENT STATE              ERROR                                                                                                                                                                       PORTS
zgkin90vm4iyt242jsjeb80h4   cassandra.1         xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm:latest@sha256:a7150a38203c44e332d05a37d8275a76a001be5b814c661d05fd73edab893437   ip-172-20-25-211.ap-southeast-1.compute.internal   Running             Preparing 13 seconds ago   "Post http://%2Frun%2Fdocker%2Fplugins%2F4125eb31d4a89cab3863d96d60403c0134f69a0d19937225a2f0839f737384e1%2Fcloudstor.sock/VolumeDriver.Mount: context deadline exceeded"   

After subsequent retires to start the service I get this error:

$ swarm service ps nvkmbnmc9nwyn8ojw3v1bkjzh --no-trunc
ID                          NAME                IMAGE                                                                                                                                                     NODE                                               DESIRED STATE       CURRENT STATE              ERROR                                                                                                                                                                                                                                                                                                                                                                                                                   PORTS
ep598hubcgsmuox2wm1wfcn1f   cassandra.1         xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm:latest@sha256:a7150a38203c44e332d05a37d8275a76a001be5b814c661d05fd73edab893437   ip-172-20-9-77.ap-southeast-1.compute.internal     Running             Preparing 51 seconds ago                                                                                                                                                                                                                                                                                                                                                                                                                           
zgkin90vm4iyt242jsjeb80h4    \_ cassandra.1     xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm:latest@sha256:a7150a38203c44e332d05a37d8275a76a001be5b814c661d05fd73edab893437   ip-172-20-25-211.ap-southeast-1.compute.internal   Shutdown            Rejected 51 seconds ago    "create cassandra-1: found reference to volume 'cassandra-1' in driver 'cloudstor:aws', but got an error while checking the driver: error while checking if volume "cassandra-1" exists in driver "cloudstor:aws": Post http://%2Frun%2Fdocker%2Fplugins%2F4125eb31d4a89cab3863d96d60403c0134f69a0d19937225a2f0839f737384e1%2Fcloudstor.sock/VolumeDriver.Get: context deadline exceeded: volume name must be unique"   

The service never seems to start.

Information

Docker-diagnose: 1527092193-JGugtUgVNBmvU7S8tXn0mV4ryIhPF4zc

Volumes created:

swarm volume create -d "cloudstor:aws" --opt ebstype=io1 --opt size=200 --opt iops=1000 --opt backing=relocatable --opt ebs_tag_Name=cassandra-1 cassandra-1

AWS Region: ap-southeast-1

Service Creation Setup:

docker service create \
  --name cassandra \
  --network data \
  --update-delay 60s \
  --replicas 1 \
  --with-registry-auth \
  --env LOCAL_JMX=no \
  --env SERVICE_NAME=cassandra \
  --constraint 'node.role != manager' \
  --reserve-memory 3gb \
  --mount type=volume,target=/var/lib/cassandra,source={{.Service.Name}}-{{.Task.Slot}} \
  xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm
ambrons commented 6 years ago

I did some testing in us-east-1 last week and 50GB, ios, and 1000 iops and it was working fine. However now that I'm using ap-southeast-1 it appears to be failing with the above error. Around the 3 minute mark:

$ swarm service create \
>   --name cassandra \
>   --network data \
> --update-delay 300s \
>   --replicas 1 \
>   --with-registry-auth \
>   --env LOCAL_JMX=no \
>   --env SERVICE_NAME=cassandra \
>   --constraint 'node.role != manager' \
>   --reserve-memory 3gb \
>   --mount type=volume,volume-driver=cloudstor:aws,source=asports-prod-{{.Service.Name}}-{{.Task.Slot}},destination=/var/lib/cassandra,volume-opt=backing=relocatable,volume-opt=size=150,volume-opt=ebstype=io1,volume-opt=iops=1000,volume-opt=ebs_tag_Name=asports-prod-{{.Service.Name}}-{{.Task.Slot}} \
>   xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm
xdlfmtzojphbuoiqd5feuhdes
overall progress: 0 out of 1 tasks 
1/1: Post http://%2Frun%2Fdocker%2Fplugins%2F483cd69b6d69e5aa11bdf44a3b13345aed… 

Just to make sure it wasn't anything else I removed the mount argument and the service runs fine with a standard container volume. The issue is only when using the EBS backed relocatable volumes that it fails.

I've also tried letting the service definition create the volume on start and that didn't seem to help either.

Here's the updated configuration:

docker service create \
  --name cassandra \
  --network data \
  --update-delay 300s \
  --replicas 1 \
  --with-registry-auth \
  --env LOCAL_JMX=no \
  --env SERVICE_NAME=cassandra \
  --constraint 'node.role != manager' \
  --reserve-memory 3gb \
  --mount type=volume,volume-driver=cloudstor:aws,source=asports-prod-{{.Service.Name}}-{{.Task.Slot}},destination=/var/lib/cassandra,volume-opt=backing=relocatable,volume-opt=size=150,volume-opt=ebstype=io1,volume-opt=iops=1000,volume-opt=ebs_tag_Name=asports-prod-{{.Service.Name}}-{{.Task.Slot}} \
  xxx.dkr.ecr.us-east-1.amazonaws.com/zco/esports/cassandra-swarm

The plugin seems to be working fine

$ docker plugin ls
ID                  NAME                DESCRIPTION                       ENABLED
e3c2802690d7        cloudstor:aws       cloud storage plugin for Docker   true
ambrons commented 6 years ago

@ddebroy Do you have any thoughts? I'm dead in the water for our deployment as this doesn't appear to work as advertised.

soar commented 6 years ago

Same problem - cloudstor:aws creates and deletes volumes successfully, but hangs when I try to start container.

# docker volume create -d "cloudstor:aws" --opt ebstype=gp2 --opt size=10 mylocalvol1
mylocalvol1
# docker volume ls
DRIVER              VOLUME NAME
cloudstor:aws       mylocalvol1
# docker run -it -v mylocalvol1:/mnt debian bash
... nothing after 10 minutes ...
VictorLopess commented 6 years ago

Hello everyone. I'm having a similar problem. I am trying to create an EBS via cloudstor, I made the configuration as below:

version: '3'
services:
  rabbitmq:
    image: rabbitmq:3.6-management-alpine
    networks:
      - my-network
    ports:
      - 5672:5672
      - 15672:15672
    volumes:
      - rabbitmq_data_staging:/var/lib/rabbitmq
    logging:
      driver: "awslogs"
      options:
        awslogs-region: "us-east-1"
        awslogs-group: "queues"
        awslogs-stream: "rabbitmq-staging"
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.mylabel == mylabelvalue
      restart_policy:
        condition: on-failure
networks:
  my-network:
    external: true

volumes:
  rabbitmq_data_staging:
    driver: "cloudstor:aws"
    driver_opts:
      size: "5"
      ebstype: "gp2"
      backing: "relocatable"

Every time I make a deploy command in swarm, it simply does not raise the container, without giving an error or something. This command is made from a manager to a worker.

When I run the command in the manger itself it works normally.

When I take out the command to mount the volume,

volumes:
      - rabbitmq_data_staging:/var/lib/rabbitmq

the container goes up normally.

I tested other plugins like this and had the same problem I tested other plugins like rexray, and had the same problem. Which makes me think that it is some incompatibility of swarm with the plugin. Can anyone help, or even tell if I'm doing something wrong on my docker-compose?

The cloudstor normally creates the EBS volume, without any problem.

The versions of the plugins are:

ID                  NAME                DESCRIPTION                       ENABLED
ed0c2ebcfc92        rexray/ebs:latest   REX-Ray for Amazon EBS            false
208c7b943f6d        cloudstor:aws       cloud storage plugin for Docker   true

@ddebroy , you can help us? Thank You!

lordvlad commented 6 years ago

have the same issue with docker4aws 18.03 (stable) and 18.04 (edge) cloudformation templates. Hadn't had the issue with docker4aws 17.12 (edge).

nunofernandes commented 6 years ago

Any news on this?

dodgemich commented 6 years ago

I ran into similar issues using it with ECS - I found that it worked with T2s and C4's, but would fail in this manner with C5/M5...might help debug the root issue.

abashev commented 5 years ago

@dodgemich you are my hero!! I spend two days to understand why rexray and cloudstor doesn't work on my new shiny t3 cluster. And I just have to migrate it on t2.

Richard-Mathie commented 5 years ago

may be #148 is related, I get issues like the above and with the mount point /dev/xvdf allready existing when trying to mount cloudstore:aws volumes to modern generation aws instances. Apparently this may have something todo with nvme drivers on that hardware

lepetitpierdol commented 5 years ago

Having the exact same issue with rexray/efs... Did anyone manage to find a solution?

abashev commented 5 years ago

@lepetitpierdol you have to use instances from previous generations - t2, c4 and so on. Looks like latest T3 and C5 have new disk controllers that don't work with rexray/convoy/cloudstor

gartz commented 5 years ago

No luck for me, I'm having issues with T2 on docker4aws 18.06.1 (stable) and 18.01 (edge) when mounting volumes using cloudstor.

aplex commented 5 years ago

In my case, I got this error when one of the containers got stuck and could not be stopped. It was holding a reference to a volume, so new container could not be started. I resolved this by rebooting the host VM.

kinghuang commented 5 years ago

There's been some PRs in REX-Ray to handle the new NVMe device names (rexray/rexray#1233, rexray/rexray#1252). I've run the edge release successfully to create and mount EBS volumes on current generation instances.

We need a similar change in Cloudstor. I really wish Docker would at least give some indication of whether they're even going to address this issue. Or, open source the code so that we can do something about it.

bandesz commented 5 years ago

@kinghuang you couldn't have said better. I tried REX-Ray, but it's not enough for my use case. I'm using Cloudstor currently on Amazon ECS, but I'm forced to use the old instance types.

@brawong, @joeabbey sorry to mention you guys, but do you have any feedback on when the NVMe devices would be supported in Cloudstor, so we could use it on the new AWS EC2 generations (t3, m5, c5, etc.)?

rafagsiqueira commented 5 years ago

I am trying to create volumes and I am running into the same problem. My cluster is based on T2 instances, so that does not seem to be the source of the problem. Docker version 18.06.1-ce, build e68fc7a.

gartz commented 5 years ago

Version 18.03.0 works fine, I rollback my stack and no more issues with cloudstor plugin.

— Gabriel Reitz Giannattasio On Oct 24, 2018, 7:36 AM -0700, Rafael Guimaraes Siqueira notifications@github.com, wrote:

I am trying to create volumes and I am running into the same problem. My cluster is based on T2 instances, so that does not seem to be the source of the problem. Docker version 18.06.1-ce, build e68fc7a. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Carlos4ndresh commented 5 years ago

I'm experiencing the same problem. I have 18.06.1-ce:

Status": { "Timestamp": "2018-10-25T16:18:57.558536532Z", "State": "preparing", "Message": "preparing", "Err": "create testapp_web: Post http://%2Frun%2Fdocker%2Fplugins%2Fcd8e305e0fc9d1030761f3dfc3a873f4923ec78a669489fb15d3123df6f1c10b%2Fcloudstor.sock/VolumeDriver.Create: context deadline exceeded", "PortStatus": {} },

I restarted EC2 instances, the swarm ended up with the master and a couple workers out of the swarm after the reboot. Then all services went up (previously recreating stack), but I have now another problem, and I think is related, but have no evidence.

mateodelnorte commented 5 years ago

Anybody find a solution here? Just started seeing this issue.

kinghuang commented 5 years ago

I haven't found any solutions for Cloudstor. I've started to use REX-Ray, but it has the downside that it doesn't copy EBS volumes between availability zones.

We really need Docker to provide an answer.

mateodelnorte commented 5 years ago

thanks @kinghuang. any tips or pointers to documentation on REX-Ray, in case we need to go that route?

@gartz was rolling back your stack as easy as running the cloudformation template with version 18.03.0 specified?

gartz commented 5 years ago

@mateodelnorte yes, it rollback, but I need to login in the new manager and force-initialize it to work, after that the workers and other managers start working again.

I also edited my cloud formation template to add EFS support to N. California (it's disabled in the original, but N. California supports it).

mateodelnorte commented 5 years ago

Currently attempting to update our CloudFormation template from 18.06.1 to 18.03.1. Our new manager came online but is clearly in an odd state:

ID                            HOSTNAME                        STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
24fr9wl3maq76rwcc4j6w28q0     ip-172-22-6-98.ec2.internal     Ready               Active              Reachable           18.06.1-ce
1jdanx00fjqily9ev6rtkz158     ip-172-22-7-254.ec2.internal    Ready               Active                                  18.06.1-ce
e3ez03wlf33aohmktfnfnwaym     ip-172-22-17-55.ec2.internal    Down                Active              Reachable           18.03.0-ce
fsvz7lhywdetcgunx005gndgq     ip-172-22-17-55.ec2.internal    Ready               Active              Unreachable         18.03.0-ce
xe4sd7jp9kbpln1ysfy1dojq3     ip-172-22-17-249.ec2.internal   Ready               Active                                  18.06.1-ce
blqwttvjaaxt8z1z79ohdb0le     ip-172-22-22-66.ec2.internal    Ready               Active              Leader              18.06.1-ce
x09n7onls3cutd4cu530o60i8     ip-172-22-34-115.ec2.internal   Ready               Active                                  18.06.1-ce
jadehezhrfgrazpjnr3i972gd *   ip-172-22-40-45.ec2.internal    Ready               Active              Reachable           18.06.1-ce
Every 2s: docker node ls                                                                                           2018-11-02 23:18:54

Notice ip-172-22-17-55.ec2.internal is listed twice. That's the new manager. It's registering as both Ready and Down.

docker info on the new manager yields:

docker info
Containers: 5
 Running: 5
 Paused: 0
 Stopped: 0
Images: 5
Server Version: 18.03.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: fsvz7lhywdetcgunx005gndgq
 Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
 Is Manager: true
 Node Address: 172.22.17.55
 Manager Addresses:
  172.22.17.55:2377
  172.22.17.55:2377
  172.22.22.66:2377
  172.22.40.45:2377
  172.22.6.98:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.81-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.785GiB
Name: ip-172-22-17-55.ec2.internal
ID: OBAN:2FHN:UX7C:BHOR:DIVY:27HI:SSAI:KSVU:NVXP:WJW7:VCX3:QWFF
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 os=linux
 region=us-east-1
 availability_zone=us-east-1b
 instance_type=m4.large
 node_type=manager
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

~ $ docker service ls
Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded

I'm not confident a swarm init --force-new-cluster on this new node will result in success. I would think it doesn't have service configuration, since it can't join and make quorum.

@gartz is this the situation you were in when you forced a new cluster?

gartz commented 5 years ago

No, you're in a new situation, the version I'm using that work is the 18.03.0, not the 18.03.1.

If you get the context deadline exceeded error DO NOT run swarm init --force-new-cluster the problem seems to be in the cloudstor plugin, you might lose your data forcing a new cluster without being able to communicate with EFS correctly.

On Fri, Nov 2, 2018 at 4:24 PM Matt Walters notifications@github.com wrote:

Currently attempting to update our CloudFormation template from 18.06.1 to 18.03.1. Our new manager came online but is clearly in an odd state:

ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION 24fr9wl3maq76rwcc4j6w28q0 ip-172-22-6-98.ec2.internal Ready Active Reachable 18.06.1-ce 1jdanx00fjqily9ev6rtkz158 ip-172-22-7-254.ec2.internal Ready Active 18.06.1-ce e3ez03wlf33aohmktfnfnwaym ip-172-22-17-55.ec2.internal Down Active Reachable 18.03.0-ce fsvz7lhywdetcgunx005gndgq ip-172-22-17-55.ec2.internal Ready Active Unreachable 18.03.0-ce xe4sd7jp9kbpln1ysfy1dojq3 ip-172-22-17-249.ec2.internal Ready Active 18.06.1-ce blqwttvjaaxt8z1z79ohdb0le ip-172-22-22-66.ec2.internal Ready Active Leader 18.06.1-ce x09n7onls3cutd4cu530o60i8 ip-172-22-34-115.ec2.internal Ready Active 18.06.1-ce jadehezhrfgrazpjnr3i972gd * ip-172-22-40-45.ec2.internal Ready Active Reachable 18.06.1-ce Every 2s: docker node ls 2018-11-02 23:18:54

Notice ip-172-22-17-55.ec2.internal is listed twice. That's the new manager. It's registering as both Ready and Down.

docker info on the new manager yields:

docker info Containers: 5 Running: 5 Paused: 0 Stopped: 0 Images: 5 Server Version: 18.03.0-ce Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: active NodeID: fsvz7lhywdetcgunx005gndgq Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded Is Manager: true Node Address: 172.22.17.55 Manager Addresses: 172.22.17.55:2377 172.22.17.55:2377 172.22.22.66:2377 172.22.40.45:2377 172.22.6.98:2377 Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c runc version: 4fc53a81fb7c994640722ac585fa9ca548971871 init version: 949e6fa Security Options: seccomp Profile: default Kernel Version: 4.9.81-moby Operating System: Alpine Linux v3.5 OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 7.785GiB Name: ip-172-22-17-55.ec2.internal ID: OBAN:2FHN:UX7C:BHOR:DIVY:27HI:SSAI:KSVU:NVXP:WJW7:VCX3:QWFF Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: os=linux region=us-east-1 availability_zone=us-east-1b instance_type=m4.large node_type=manager Experimental: true Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

~ $ docker service ls Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded

I'm not confident a swarm init --force-new-cluster on this new node will result in success. I would think it doesn't have service configuration, since it can't join and make quorum.

@gartz https://github.com/gartz is this the situation you were in when you forced a new cluster?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/for-aws/issues/157#issuecomment-435537400, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGFGIqzpm93lg1EnFoBcpu2_orfEAwRks5urNQpgaJpZM4UKx62 .

mateodelnorte commented 5 years ago

That was a typo on my part. The non-connecting manager, and upgrade version we're attempting to move toward is 18.03.0-ce.

mapcentia commented 5 years ago

I also experience this issue on both Stable and Edge. I tried to downgrade stable to 18.03.0-ce but with no luck. I deploy using a docker-compose with:

volumes:
    gc2core_var_www_geocloud2:
      driver: cloudstor:aws
      driver_opts:
        :backing: shared

CURRENT STATE just keep saying "Preparing [--] minutes ago"

EDIT 1: Just got the service up and running on a clean install of T2 instances using Edge

EDIT 2: When I tried to deploy with 2 replicas it took like 19 minutes for one service to get running. One replica did thrown context deadline exceeded: volume name must be unique. But all got up and running eventually.

gartz commented 5 years ago

My case I could downgrade in one of the CloudFormation and the problem was solved, in the other CloudFormation the downgrade it self-didn't fix the problem, so I did a new CloudFormation using the version 18.03.0-ce then moved data from broken EFS to the new EFS mounting it manually in an EC2 temporary instance. Finally, I started my docker services in the new CloudFormation and as it goes it detected the folders in the EFS and it worked.

Don't forget that you need to use the CloudFormation file from 18.03.0-ce not only change the version in the current file, just change the version in the file won't change the AMI used to spawn Instances.

I hope this information helps. It's a very frustrating problem, hard to detect and hard to fix.

anasoler commented 5 years ago

Downgrading to 18.03.0-ce from 18.06.1-ce (where I was experiencing the same issue) worked for me too.

dodgemich commented 5 years ago

In terms of NVMe support, is this getting addressed? (Seems like two issues discussed in the comments).

@FrenchBen handled https://github.com/docker/for-aws/issues/148 for root NVMe - perhaps he has some insight on adding in to Cloudstor??

kinghuang commented 5 years ago

Yeah, I think the comments here are describing two different problems.

  1. Some users are having problems with Cloudstor on 18.06.1, regardless of whether NVMe volumes are being used. Downgrading to 18.03.0 appears to be the solution for these users.
  2. Others (myself included) want Cloudstor updated to handle NVMe volumes on current generation instances.

There's been zero communication from Docker about either problem, AFAIK.

dodgemich commented 5 years ago

Yeah, I think the comments here are describing two different problems.

1. Some users are having problems with Cloudstor on 18.06.1, regardless of whether NVMe volumes are being used. Downgrading to 18.03.0 appears to be the solution for these users.

2. Others (myself included) want Cloudstor updated to handle NVMe volumes on current generation instances.

There's been zero communication from Docker about either problem, AFAIK.

Agreed - my issue is (2). Not sure if worth cutting a new ticket to split them up...or how to get better info from Docker on when they'll address...without addressing that, Cloudstor is basically on the path for retirement.

kinghuang commented 5 years ago

That’s a good idea. I’ll create an issue for the second issue (NVMe mounts on current generation instances).

Isn’t Cloudstor also part of Docker EE on AWS (Docker Certified Infrastructure)?

kinghuang commented 5 years ago

Created #184 for the second issue.

daaru00 commented 5 years ago

Same error here, deploying a new stack raise an error context deadline exceeded, updating the stack raise a similar error: context deadline exceeded: volume name must be unique. The inability to create shareable volumes between instances creates enormous problems and I can no longer use half the services..

Any update on this issue?

matthewmrichter commented 5 years ago

I'm hitting this now on t3 instances.

a-marcel commented 5 years ago

I'm hitting this now on t3 instances.

Me too

darkl0rd commented 5 years ago

All "Nitro" Based instances are affected, which make use of the new "/dev/nvme*" block devices. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html#ec2-nitro-instances

Workarounds:

When you are experiencing issues with a current generation instance; e.g. one that does not use the new block device names yet - some have reported that downgrading to the 18.03 driver alleviates the problem. I can not personally confirm this, as I have only dealt with the former problem myself.

matthewmrichter commented 5 years ago

I was also one of the people who reported this issue when it popped up for RexRay as well - in case it helps with prompt resolution of the cludstor issue, here is the relevant issue in their GitHub: https://github.com/rexray/rexray/pull/1252