docker-archive / for-aws

92 stars 26 forks source link

Cloudstor EBS Volume Recreated on Service Restart, Original Volume Destroyed #176

Open dviator opened 5 years ago

dviator commented 5 years ago

Expected behavior

When a swarm service crashes and restarts, it mounts the same cloudstor EBS volume it was using before it restarted.

Actual behavior

A new EBS volume was created with the same CloudstorVolumeName in AWS. The new volume was mounted in the restarted service, which happens to be a jenkins master. As a result, the service lost access to it's configuration data, and appeared to come up as an entirely fresh instance.

At this point, the original volume was listed in AWS as 'available', while the newly created volume was listed as 'in-use'

Unfortunately, while investigating the issue, we restarted the service. This replayed the issue, and caused another new volume to be created and mounted by the service. This was the third volume total.

At this point, I happened to see in the AWS console that the original volume containing our actual data was destroyed and disappeared from the console. The 2 EBS volumes with the same cloudstor name are now both listed as 'in-use.' I am not entirely sure which one is actually mounted in the container, though I suspect it is the latest one.

Information

~ $ docker-diagnose OK hostname=ip-172-31-2-55-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3 OK hostname=ip-172-31-38-198-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3 OK hostname=ip-172-31-16-88-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3 OK hostname=ip-172-31-28-72-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3 OK hostname=ip-172-31-33-252-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3 OK hostname=ip-172-31-13-102-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3 OK hostname=ip-172-31-33-37-ec2-internal session=1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3 Done requesting diagnostics. Your diagnostics session ID is 1539031114-23xu0GqNnL2nNIfbzxPaPsFZ9C5GxrB3

~ $ docker version Client: Version: 17.12.0-ce API version: 1.35 Go version: go1.9.2 Git commit: c97c6d6 Built: Wed Dec 27 20:05:03 2017 OS/Arch: linux/amd64

Server: Engine: Version: 17.12.0-ce API version: 1.35 (minimum version 1.12) Go version: go1.9.2 Git commit: c97c6d6 Built: Wed Dec 27 20:12:30 2017 OS/Arch: linux/amd64 Experimental: true

It may also be relevant to this issue that our swarm appears to be suffering from the issue where we cannot receive docker events in the swarm. Described here: https://github.com/moby/moby/issues/36834

Steps to reproduce the behavior

As this is our production swarm, and this issue may result in losing a data volume, I am very reluctant to reproduce this issue in in this swarm. Up until now, we have seen cloudstor behavior act as expected in both our staging and production swarms.

I have used the workaround from this issue: https://github.com/docker/for-aws/issues/122 to take manual backup snapshots, which appear to now be avoiding deletion, but I'd like to take some time to be certain that these snapshots will stick around and be restorable in case this issue happens to any more of our volumes.

Happy to provide more info as needed, as I am not sure where I can find swarm system related logs to try to deduce what exactly went wrong here.