buildkite / elastic-ci-stack-for-aws

An auto-scaling cluster of build agents running in your own AWS VPC
https://buildkite.com/docs/quickstart/elastic-ci-stack-aws
MIT License
418 stars 272 forks source link

Support a variable number of AvailabilityZones #700

Open jmchuster opened 4 years ago

jmchuster commented 4 years ago

If i try to set the value for AvailabilityZones to a single zone, say us-west-2a, it fails with

There was an error creating this change set
Template error: Fn::Select cannot select nonexistent value at index 1

I only need a single availability zone here, because the auto-scale group will try rebalancing instances across zones, killing my long-running jobs (24 hours +).

sj26 commented 4 years ago

Oh, weird, sorry! Sounds like a bug. Looks like we presume two availability zones in which to create subnets:

https://github.com/buildkite/elastic-ci-stack-for-aws/blob/a89a3f5c95dafd94030feaf93c9150987345a272/templates/aws-stack.yml#L546-L574

What happens if you specify us-west-2a,us-west-2a? Is that a reasonable workaround?

Another possible workaround, could you suspend the AZRebalance scaling process? This wouldn't stick around between updates, though.

Otherwise, sounds like a bug to be fixed in the CloudFormation, somehow.

lox commented 4 years ago

Otherwise, sounds like a bug to be fixed in the CloudFormation, somehow.

Unfortunately it can't 😢 There is no way to vary CloudFormation templates based on the length of input. It's extremely frustrating.

sj26 commented 4 years ago

@lox can we list a few and make them conditional on the presence of the az?

lox commented 4 years ago

This is how I approached it in an experiment @sj26:

https://github.com/buildkite/elastic-ci-stack-for-aws-ecs/blob/2cf72edd0ba9e40335f58b48320ea94d94e3c464/templates/vpc/template.yaml#L9 https://github.com/buildkite/elastic-ci-stack-for-aws-ecs/blob/2cf72edd0ba9e40335f58b48320ea94d94e3c464/templates/vpc/template.yaml#L56-L66

I've been petitioning AWS for Fn::Size for 3+ years 😢

sj26 commented 4 years ago

... oh my

sj26 commented 4 years ago

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/intrinsic-function-reference-select.html ... you should be certain that the index you choose is valid ... How?!

sj26 commented 4 years ago

I think I'm going too deep https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/intrinsic-function-reference-transform.html https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/template-macros.html#template-macros-lambda-interface

... I think we'll raise a ticket with AWS and ask them how to do it, which might at least add to "petitioning AWS for Fn::Size"

jmchuster commented 4 years ago

Tried setting the availability zones to us-west-2a,us-west-2a, the subnet update failed with

CIDR Block must change if Availability Zone is changed and VPC ID is not changed
lox commented 4 years ago

Sounds like the underlying issue is the rebalancing across AZs, correct? @sj26 i believe there was some internal discussion around using a lambda to update that setting on the ALB? It’s not exposed via Cloudformation.

sj26 commented 4 years ago

It sounds the answer is "use a lambda". AWS said:

Having said that, In order to iterate over a list of "AvailabilityZones", and produce a resource for each element in the list, You can use a Lambda function to create a macro to transform your template based on the list of "AvailabilityZones" provided by your customer (preferred), or build your template dynamically - there are several libraries that can help you do this, like troposphere and the AWS CDK.

Or use a lambda to do as @lox suggests, to suspend AZRebalance.

sj26 commented 4 years ago

@jmchuster ah, that sounds like an error from an update. Could you try destroying the stack entirely and re-creating it? I think that workaround is the easiest and most promising.

sj26 commented 4 years ago

Another alternative workaround would be to run a custom version of the stack, editing the YAML to remove the second subnet.

sj26 commented 4 years ago

If none of the workarounds works then someone industrious might like to create a cloudformation macro for our stack like this article:

https://medium.com/pablo-perez/dynamic-resource-generation-of-resources-using-cloudformation-macros-f6baba75d730

It could look at the list of AvailabilityZones and create an appropriate number of appropriately configured Subnet resources.

I'm hesitant to do this because of the added complexity and maintenance burden, but it might be our only option to do it "properly".

sj26 commented 4 years ago

Another suggested workaround from AWS:

As a workaround, you could write your template to support a maximum number of Availability zones (2 in this case), and then have each Availability zone passed in as a separate parameter. If a customer wants to use only 1 Availability zone, you could instruct the customer to leave the second one blank. Then, you could use conditional statements to make the second Availability zone optional, so that it's only created if a second Availability zone is passed in.

However, this would result in a pretty ugly template with redundant information. And does not allow for more Availability zones than you have set as the maximum, so if a customer wants to use 3 Availability zones and your template only allows for 2, the 3rd one would not be created.

I don't love it, but I don't hate it either.

lox commented 4 years ago

I think I prefer the approach I took in that linked CFN template once you mix in things like public vs private subnets and NAT Gateway redundancy.

sj26 commented 4 years ago

Yeah I think I do too, @lox.

Not sure when we'll get to this, but we should roll something like that in.

Meanwhile, @jmchuster, did you try re-creating your stack with the same az twice? Does it work?

jmchuster commented 4 years ago

So i ended up just modifying the template to have a param to choose 1 or 2 AvailabilityZones.

  AvailabilityZoneCount:
    Type: Number
    Description: Optional - Number of subnets that are created from AZs (chosen from !GetAZs or AvailabilityZones parameter if provided)
    Default: 2
    MinValue: 1
    MaxValue: 2

so added the conditions

    CreateSubnet0:
      !And
        - { Condition: CreateVpcResources }
        - !Or
          - !Equals [ 1, !Ref AvailabilityZoneCount ]
          - !Equals [ 2, !Ref AvailabilityZoneCount ]

    CreateSubnet1:
      !And
        - { Condition: CreateVpcResources }
        - !Equals [ 2, !Ref AvailabilityZoneCount ]

swapped them out as the Condition for Subnet[01] and Subnet[01]Routes

and then for the AgentAutoScaleGroup

      VPCZoneIdentifier: !If
        - CreateVpcResources
        - !If
          - CreateSubnet1
          - [ !Ref Subnet0, !Ref Subnet1 ]
          - [ !Ref Subnet0 ]
        - !Ref Subnets

So, presumably, you could add options for 3 or 4 AZs as well and keep on nesting !Ifs following the same pattern, up to AWS's practical limit of 5.

I had looked into maybe setting the SuspendProcesses based on a param, but it looks like AWS's intention is that it be used only as a temporary state, rather than as a property of the auto-scaling group, seeing as they only provide it under UpdatePolicy as something you set and then automatically unset as part of an update. I'm sure there's an argument to be made that this thinking makes sense for Terminate but less so for AZRebalance.

yob commented 3 years ago

Supporting different AZ (and public/private) configurations is still something we're keen to support at some point. In the interim, the new v5.0.0 release disables AZRebalancing which should address at least one of the reasons why single AZ deployments are sometimes desirable.

yob commented 3 years ago

Another note here: I believe it might be possible to run in a single AZ (or 3+) by creating the VPC and subnets manually (or via terraform, or your preferred AWS tools), then passing the subnet(s) IDs into the elastic stack via the Subnets stack parameter.

keithduncan commented 3 years ago

I found an AWS VPC template that takes the same approach as https://github.com/buildkite/elastic-ci-stack-for-aws/issues/700#issuecomment-679705989 and exposes a NumberOfAZs parameter along with a plethora of conditions.

keithduncan commented 3 years ago

Weird as that approach is, I’ve warmed to it as a pragmatic answer to the problem that doesn’t involve using Lambdas to rewrite the template live 🙈

I reckon support for 2, 3, and 4 AZs to bring parity with that Quick Start would be good.

A nice customer facing benefit in supporting more AZs is to broaden the number of spot pools that are available to the Auto Scaling group. More pools, better chance of spot allocation, less chance of spot interruption.