docker-archive / for-aws

92 stars 26 forks source link

CloudFormation CIDR Block must change #17

Open hutchic opened 7 years ago

hutchic commented 7 years ago

moved from https://github.com/docker/docker/issues/31612

Description

Tried upgrading a Docker for AWS 1.13.0 to CloudFormation stack to 17.03.0 CE Stable by using https://editions-us-east-1.s3.amazonaws.com/aws/stable/Docker.tmpl and got the error message:

CIDR Block must change if Availability Zone is changed and VPC ID is not changed

when the cloudformation tries to update PubSubnetAz3. This is all in AWS us-east

kencochrane commented 7 years ago

@hutchic which version of docker for AWS were you using before you attempted the upgrade to the latest stable version?

hutchic commented 7 years ago

The swarm was created using Docker for AWS 1.13.0 so it is running 1.13.0

➜  ~ docker info
Containers: 9
 Running: 9
 Paused: 0
 Stopped: 0
Images: 33
Server Version: 1.13.0
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: awslogs
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
Swarm: active
 NodeID: 76hqn8lf6trpoy60erdhtgku1
 Is Manager: true
 ClusterID: mmngj1f0qiwkjdfqfclq7f2tn
 Managers: 3
 Nodes: 8
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 172.31.44.228
 Manager Addresses:
  172.31.10.251:2377
  172.31.17.99:2377
  172.31.44.228:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.4-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.38 GiB
Name: ip-172-31-44-228.ec2.internal
ID: INBQ:JVCX:4NQT:ALWV:3QKH:LMX5:NXWU:DFMM:TVHQ:R2QJ:TNIM:XUDU
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 413
 Goroutines: 788
 System Time: 2017-03-07T21:26:38.306442043Z
 EventsListeners: 4
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

locally I'm using Docker version 17.03.0-ce, build 60ccb22 (although I suspect that's not relevant)

kencochrane commented 7 years ago

@hutchic thank you, was there any other error messages in cloudformation, or was that the only one? my guess is that the order of AZ's that is returned from cloudformation changed the order, and this caused the problem. I'll have to see if there is a way to make them consistent.

hutchic commented 7 years ago

same error message for a second subnet. Full sequence can be seen here

kencochrane commented 7 years ago

@hutchic thanks for sending that along. I assume you are on us-east-1? You mentioned us-east earlier, but there are two us-east's now with the Ohio region, so I just want to make sure I had the correct one.

If you look at this page. "EC2 Dashboard" for us-east-1 https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#

What do you see for availability zones under 'Availability Zone Status:' ?

Do you see us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1e? if not, what do you see?

Also, which AZ's do you have your current stack running in now? You can see this on the instances page on the EC2 dashboard.

hutchic commented 7 years ago
Service Status:

US East (N. Virginia):
This service is operating normally
Availability Zone Status:
us-east-1a:
Availability zone is operating normally
us-east-1b:
Availability zone is operating normally
us-east-1c:
Availability zone is operating normally
us-east-1d:
Availability zone is operating normally
us-east-1e:
Availability zone is operating normally

Currently the stack is running in us-east1 a c and d

kencochrane commented 7 years ago

OK, thanks, here is my guess for what happened.

When you originally created the stack, you had access to zones a , c, and d. So it created the nodes in those AZ's

Zone 1: a Zone 2: c Zone 3: d

Now when you upgraded it now has access to zones a, b, c, d, and e. So it is trying to use zones a, b, and c. This is conflicting with zones a, c and d that were used before.

zone 1: a -> a (that is fine) zone 2: c -> b (that one is fine since we didn't use b before.) zone 3: d -> c (this causes the issue because we used c before, and the CIDR block for c is the same as d, and they need to be different)

This brings up an issue with how we are building our Subnets. If the availability zone list changes, it could cause problems during upgrade.

One way around this, would allow a user to specify their own AZ list as a parameter, and if something like this happens then they can use that parameter to pick the correct Availability Zones.

Open to other ideas as well.

Your fix

To get around your issue, we can try this.

Open up the stable template found here: https://editions-us-east-1.s3.amazonaws.com/aws/stable/Docker.tmpl search for this block

"us-east-1": {
                "AZ0": "0",
                "AZ1": "1",
                "AZ2": "2",
                "EFSSupport": "no",
                "Name": "N. Virgina",
                "NumAZs": "4"
            },

Change it to this:

"us-east-1": {
                "AZ0": "0",
                "AZ1": "2",
                "AZ2": "3",
                "EFSSupport": "no",
                "Name": "N. Virgina",
                "NumAZs": "4"
            },

Changing AZ1 to '2' and AZ2 to '3'. This will tell the template to use availability Zones A, C, and D to match the ones you had originally.

hutchic commented 7 years ago

Can confirm the fix worked. I think I'll leave this open until the underlying issue is resolved(?)

kencochrane commented 7 years ago

@hutchic good to know, thanks for confirming. Yes lets keep it open until we figure out the best way forward.

kencochrane commented 7 years ago

putting this here for future ken, before I forget, another idea I had was to the list of AZs that we use on the first install, and then reference that list in upgrades, so we always use the same AZ's each time. we could store in dynamo, and get the values via lambda function.

composer22 commented 7 years ago

27 is a different issue. It references a fresh install over a given VPC. The subnet IP addresses are hard coded in both community and enterprise templates.

tobiasmcnulty commented 7 years ago

Hi all (fancy meeting you here @kencochrane!) -- I got hit by this today as well, though in a different template. Wondering if you found a workable solution?

kencochrane commented 7 years ago

@tobiasmcnulty hey, how have you been? I haven't found one yet, but I have a meeting with the cloud formation team today, I'll bring it up, to see if they have any suggestions.

tobiasmcnulty commented 7 years ago

Good, thanks! FWIW I ended up adding "primary" and "secondary" AZs as parameters; seemed like the only way to keep them static, that I could see. Curious to hear if you find another route.

Tobias McNultyChief Executive Officer

tobias@caktusgroup.com www.caktusgroup.com

On Mon, Apr 24, 2017 at 2:11 PM, Ken Cochrane notifications@github.com wrote:

@tobiasmcnulty https://github.com/tobiasmcnulty hey, how have you been? I haven't found one yet, but I have a meeting with the cloud formation team today, I'll bring it up, to see if they have any suggestions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/for-aws/issues/17#issuecomment-296775935, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJxhL8phgRTfE7vKKYAa6ZpWy-D-oO8ks5rzOXSgaJpZM4MV3tB .