ThoughtWorksStudios / eb_deployer

AWS Elastic Beanstalk blue-green deployment automation from ThoughtWorks Mingle Team
MIT License
400 stars 94 forks source link

Red/Black deployment strategy #44

Closed mefellows closed 9 years ago

mefellows commented 9 years ago

This feature introduces a new Strategy, which is really a fresh take on the Blue Green approach. In short, in many cases its ideal to remove the inactive stack post deployment - generally for keeping costs down.

This strategy follows the Blue Green approach, but will tear down a stack after a successful migration to the new 'Red' Environment, marking the old one as 'black' and terminating.

It should be noted that I considered modifying the existing Blue-Green strategy, but felt it was best to separate for the following reasons:

  1. Existing users should not be confused by the new strategy
  2. I didn't want to complicate the behaviour of the pattern further by introducing more knobs/levers
  3. We may want to modify the Red/Black pattern to do other things going forward, such as introducing a wait period/signal before terminating the inactive environment and so on.

What are your thoughts?

wpc commented 9 years ago

hi @mefellows:

Many thanks for the contribution. There are mostly 2 flavors of blue green implementation on AWS: switch routing on ELB (e.g. asguard) or on DNS record (e.g. our eb_deployer). Both has pros and cons. DNS switching approach is safer for on going requests and less intrusive for application's own infrastructure, but has the con that it is never theoretically safe for killing the inactive environment because of DNS caching. So what I generally do in production is letting inactive environment scale down to 1 instance using "inactive_settings" config. This both helps us cope with DNS cache, and provides us a safe-net for bad version.

Our production settings:

    option_settings:
      - namespace: aws:autoscaling:asg
        option_name: MinSize
        value: "5"
    inactive_settings:
      - namespace: aws:autoscaling:asg
        option_name: MinSize
        value: "1"

Similarly if you are confident traffic to inactive will die out in a specific period of time, you can use following settings to gradually kill idle instances to 0 to save cost.

    option_settings:
      # providing least redundancy
      - namespace: aws:autoscaling:asg
        option_name: MinSize
        value: "2"
      # make sure cooldown is reset back to default when environment become active again
      - namespace: aws:autoscaling:asg
        option_name: Cooldown
        value: "360"
    inactive_settings:
      # reduce instance count to 0 to save cost
      - namespace: aws:autoscaling:asg
        option_name: MinSize
        value: "0"
      # make sure cooldown is big enough to cope with DNS cache
      - namespace: aws:autoscaling:asg
        option_name: Cooldown
        value: "900"

The above configuration will wait at least 15 minutes (900 seconds) to kill the last instance in inactive, and in most case it is safe. Also the scale down process is off from the deployment process, so the deployment time will not increase even you have long Cooldown buffer.

You can also play with aws:autoscaling:scheduledaction options to schedule a scale down to zero action after a specific time. I haven't tried but it should be straightforward.

For this pull request I will not merge because there is already configuration to achieve the same goal. We definitely could do a better job for "inactive_setting" documentation. Sorry about that. But please keep up for contribution.

-- wpc

mefellows commented 9 years ago

Thanks WPC. This makes a lot of sense, I was unaware it could be achieved with configuration alone. I was aware of the issues with DNS, and was planning to implement some form of draining/timeout implementation - but this will do nicely.

I'll test this out today!

betarelease commented 9 years ago

PC

This is awesome write up. Can you please blog. Call it 'saving costs while being resilient on aws' or something.

On Jun 25, 2015, at 3:43 PM, WPC notifications@github.com wrote:

hi @mefellows:

Many thanks for the contribution. There are mostly 2 flavors of blue green implementation on AWS: switch routing on ELB (e.g. asguard) or on DNS record (e.g. our eb_deployer). Both has pros and cons. DNS switching approach is safer for on going requests and less intrusive for application's own infrastructure, but has the con that it is never theoretically safe for killing the inactive environment because of DNS caching. So what I generally do in production is letting inactive environment scale down to 1 instance using "inactive_settings" config. This both helps us cope with DNS cache, and provides us a safe-net for bad version.

Our production settings:

option_settings:
  - namespace: aws:autoscaling:asg
    option_name: MinSize
    value: "5"
inactive_settings:
  - namespace: aws:autoscaling:asg
    option_name: MinSize
    value: "1"

Similarly if you are confident traffic to inactive will die out in a specific period of time, you can use following settings to gradually kill idle instances to 0 to save cost.

option_settings:
  # providing least redundancy
  - namespace: aws:autoscaling:asg
    option_name: MinSize
    value: "2"
  # make sure cooldown is reset back to default when environment become active again
  - namespace: aws:autoscaling:asg
    option_name: Cooldown
    value: "360"
inactive_settings:
  # reduce instance count to 0 to save cost
  - namespace: aws:autoscaling:asg
    option_name: MinSize
    value: "0"
  # make sure cooldown is big enough to cope with DNS cache
  - namespace: aws:autoscaling:asg
    option_name: Cooldown
    value: "900"

The above configuration will wait at least 15 minutes (900 seconds) to kill the last instance in inactive, and in most case it is safe. Also the scale down process is off from the deployment process, so the deployment time will not increase even you have long Cooldown buffer.

You can also play with aws:autoscaling:scheduledaction options to schedule a scale down to zero action after a specific time. I haven't tried but it should be straightforward.

For this pull request I will not merge because there is already configuration to achieve the same goal. We definitely could do a better job for "inactive_setting" documentation. Sorry about that. But please keep up for contribution.

-- wpc -- wpc

— Reply to this email directly or view it on GitHub.

mefellows commented 9 years ago

Sadly, beanstalk doesn't let you set the limit to 0 :( (See http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/command-options.html#command-options-general-autoscalingasg)

I'll toy with some other options, but otherwise perhaps we might need to look at alternatives such as the above PR.

wpc commented 9 years ago

Hi, @mefellows:

Yes I know it said in document, but you do can set to 0. ElasticBeanstalk used to have that constrain, but it changed recently ( Maybe they do listen to our feature request :-)).

-- wpc

mefellows commented 9 years ago

Sorry for the delay in getting back to you @wpc, I have been traveling for the past 5 weeks.

Unfortunately, at least in the ap-southeast-2 region, I am unable to set that value to 0 - when I do so I get a configuration error back from the AWS API.

i.e.

    inactive_settings:
      # reduce instance count to 0 to save cost
      - namespace: aws:autoscaling:asg
        option_name: MinSize
        value: "0"

Is not allowed, but a value of 1 or greater is allowed. I will test this again today and provide more information. I will also contact our TAM to see if this feature request might be incorporated. In the likely event the feature won't be incorporated, would you support this or a similar PR for this type of deployment?

mefellows commented 9 years ago

Here is the error message returned from the API when running eb_deploy with above suggestion (run with the --debug flag enabled). Confusingly, the error seems to be about batch size even though that option is completely ommitted from my configuration, so should be set to a default:

I, [2015-08-03T14:28:43.002834 #45291]  INFO -- : [Aws::ElasticBeanstalk::Client 400 1.881137 0 retries] update_environment(environment_id:"e-rjubczzsim",option_settings:[{"namespace"=>"aws:autoscaling:asg","option_name"=>"MinSize","value"=>"0"},{"namespace"=>"aws:autoscaling:asg","option_name"=>"Cooldown","value"=>"60"}]) Aws::ElasticBeanstalk::Errors::ConfigurationValidationException Configuration validation exception: Invalid option value: '0' (Namespace: 'aws:autoscaling:updatepolicy:rollingupdate', OptionName: 'MaxBatchSize'): Value is less than minimum allowed value: 1

/opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.1.1/lib/seahorse/client/plugins/raise_response_errors.rb:15:in `call': Configuration validation exception: Invalid option value: '0' (Namespace: 'aws:autoscaling:updatepolicy:rollingupdate', OptionName: 'MaxBatchSize'): Value is less than minimum allowed value: 1 (Aws::ElasticBeanstalk::Errors::ConfigurationValidationException)
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.1.1/lib/aws-sdk-core/plugins/param_converter.rb:21:in `call'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.1.1/lib/seahorse/client/plugins/response_target.rb:18:in `call'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.1.1/lib/seahorse/client/request.rb:70:in `send_request'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.1.1/lib/seahorse/client/base.rb:207:in `block (2 levels) in define_operation_methods'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/aws_driver/beanstalk.rb:25:in `update_environment_settings'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/throttling_handling.rb:13:in `block in method_missing'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/utils.rb:13:in `backoff'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/throttling_handling.rb:12:in `method_missing'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/eb_environment.rb:37:in `block in apply_settings'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/eb_environment.rb:120:in `with_polling_events'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/eb_environment.rb:36:in `apply_settings'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/deployment_strategy/blue_green.rb:29:in `deploy'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/default_component.rb:16:in `deploy'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/environment.rb:32:in `block in deploy'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/environment.rb:31:in `each'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/environment.rb:31:in `deploy'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer.rb:214:in `deploy'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer.rb:257:in `cli'
    from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/bin/eb_deploy:11:in `<top (required)>'
    from /opt/boxen/rbenv/versions/2.1.2/bin/eb_deploy:23:in `load'
    from /opt/boxen/rbenv/versions/2.1.2/bin/eb_deploy:23:in `<main>'
mefellows commented 9 years ago

Nevermind the above error - I need to dig further into this as it appears the rolling update settings can't be used in conjunction with this approach. That being said, with this approach you are still left with a 'RED' environment and a Load Balancer that you are paying for. It would still be preferable to remove the old environment altogether. What are your thoughts?

wpc commented 9 years ago

Hi, @mefellows:

I only tried MinSize=0 in us-east-1 region. I may try on other regions to verify once I got some time.

I think the left over ELB is the minimal cost we need to pay to keep the blue green deployment relative safe. But I may be too conservative. So if you feel in need of this strongly please change the pull request to make it an blue-green deployment configuration option instead of separated deployment strategy. E.g an top-level configuration field like "blue-green-terminate-inactive". Then I will be happy to merge it in.

mefellows commented 9 years ago

Thanks @wpc, I'll look at this today and submit an updated PR.

mefellows commented 9 years ago

Closing in favour of #52