Closed mefellows closed 9 years ago
hi @mefellows:
Many thanks for the contribution. There are mostly 2 flavors of blue green implementation on AWS: switch routing on ELB (e.g. asguard) or on DNS record (e.g. our eb_deployer). Both has pros and cons. DNS switching approach is safer for on going requests and less intrusive for application's own infrastructure, but has the con that it is never theoretically safe for killing the inactive environment because of DNS caching. So what I generally do in production is letting inactive environment scale down to 1 instance using "inactive_settings" config. This both helps us cope with DNS cache, and provides us a safe-net for bad version.
Our production settings:
option_settings:
- namespace: aws:autoscaling:asg
option_name: MinSize
value: "5"
inactive_settings:
- namespace: aws:autoscaling:asg
option_name: MinSize
value: "1"
Similarly if you are confident traffic to inactive will die out in a specific period of time, you can use following settings to gradually kill idle instances to 0 to save cost.
option_settings:
# providing least redundancy
- namespace: aws:autoscaling:asg
option_name: MinSize
value: "2"
# make sure cooldown is reset back to default when environment become active again
- namespace: aws:autoscaling:asg
option_name: Cooldown
value: "360"
inactive_settings:
# reduce instance count to 0 to save cost
- namespace: aws:autoscaling:asg
option_name: MinSize
value: "0"
# make sure cooldown is big enough to cope with DNS cache
- namespace: aws:autoscaling:asg
option_name: Cooldown
value: "900"
The above configuration will wait at least 15 minutes (900 seconds) to kill the last instance in inactive, and in most case it is safe. Also the scale down process is off from the deployment process, so the deployment time will not increase even you have long Cooldown buffer.
You can also play with aws:autoscaling:scheduledaction options to schedule a scale down to zero action after a specific time. I haven't tried but it should be straightforward.
For this pull request I will not merge because there is already configuration to achieve the same goal. We definitely could do a better job for "inactive_setting" documentation. Sorry about that. But please keep up for contribution.
-- wpc
Thanks WPC. This makes a lot of sense, I was unaware it could be achieved with configuration alone. I was aware of the issues with DNS, and was planning to implement some form of draining/timeout implementation - but this will do nicely.
I'll test this out today!
PC
This is awesome write up. Can you please blog. Call it 'saving costs while being resilient on aws' or something.
On Jun 25, 2015, at 3:43 PM, WPC notifications@github.com wrote:
hi @mefellows:
Many thanks for the contribution. There are mostly 2 flavors of blue green implementation on AWS: switch routing on ELB (e.g. asguard) or on DNS record (e.g. our eb_deployer). Both has pros and cons. DNS switching approach is safer for on going requests and less intrusive for application's own infrastructure, but has the con that it is never theoretically safe for killing the inactive environment because of DNS caching. So what I generally do in production is letting inactive environment scale down to 1 instance using "inactive_settings" config. This both helps us cope with DNS cache, and provides us a safe-net for bad version.
Our production settings:
option_settings: - namespace: aws:autoscaling:asg option_name: MinSize value: "5" inactive_settings: - namespace: aws:autoscaling:asg option_name: MinSize value: "1"
Similarly if you are confident traffic to inactive will die out in a specific period of time, you can use following settings to gradually kill idle instances to 0 to save cost.
option_settings: # providing least redundancy - namespace: aws:autoscaling:asg option_name: MinSize value: "2" # make sure cooldown is reset back to default when environment become active again - namespace: aws:autoscaling:asg option_name: Cooldown value: "360" inactive_settings: # reduce instance count to 0 to save cost - namespace: aws:autoscaling:asg option_name: MinSize value: "0" # make sure cooldown is big enough to cope with DNS cache - namespace: aws:autoscaling:asg option_name: Cooldown value: "900"
The above configuration will wait at least 15 minutes (900 seconds) to kill the last instance in inactive, and in most case it is safe. Also the scale down process is off from the deployment process, so the deployment time will not increase even you have long Cooldown buffer.
You can also play with aws:autoscaling:scheduledaction options to schedule a scale down to zero action after a specific time. I haven't tried but it should be straightforward.
For this pull request I will not merge because there is already configuration to achieve the same goal. We definitely could do a better job for "inactive_setting" documentation. Sorry about that. But please keep up for contribution.
-- wpc -- wpc
— Reply to this email directly or view it on GitHub.
Sadly, beanstalk doesn't let you set the limit to 0
:( (See http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/command-options.html#command-options-general-autoscalingasg)
I'll toy with some other options, but otherwise perhaps we might need to look at alternatives such as the above PR.
Hi, @mefellows:
Yes I know it said in document, but you do can set to 0. ElasticBeanstalk used to have that constrain, but it changed recently ( Maybe they do listen to our feature request :-)).
-- wpc
Sorry for the delay in getting back to you @wpc, I have been traveling for the past 5 weeks.
Unfortunately, at least in the ap-southeast-2 region, I am unable to set that value to 0 - when I do so I get a configuration error back from the AWS API.
i.e.
inactive_settings:
# reduce instance count to 0 to save cost
- namespace: aws:autoscaling:asg
option_name: MinSize
value: "0"
Is not allowed, but a value of 1 or greater is allowed. I will test this again today and provide more information. I will also contact our TAM to see if this feature request might be incorporated. In the likely event the feature won't be incorporated, would you support this or a similar PR for this type of deployment?
Here is the error message returned from the API when running eb_deploy
with above suggestion (run with the --debug
flag enabled). Confusingly, the error seems to be about batch size even though that option is completely ommitted from my configuration, so should be set to a default:
I, [2015-08-03T14:28:43.002834 #45291] INFO -- : [Aws::ElasticBeanstalk::Client 400 1.881137 0 retries] update_environment(environment_id:"e-rjubczzsim",option_settings:[{"namespace"=>"aws:autoscaling:asg","option_name"=>"MinSize","value"=>"0"},{"namespace"=>"aws:autoscaling:asg","option_name"=>"Cooldown","value"=>"60"}]) Aws::ElasticBeanstalk::Errors::ConfigurationValidationException Configuration validation exception: Invalid option value: '0' (Namespace: 'aws:autoscaling:updatepolicy:rollingupdate', OptionName: 'MaxBatchSize'): Value is less than minimum allowed value: 1
/opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.1.1/lib/seahorse/client/plugins/raise_response_errors.rb:15:in `call': Configuration validation exception: Invalid option value: '0' (Namespace: 'aws:autoscaling:updatepolicy:rollingupdate', OptionName: 'MaxBatchSize'): Value is less than minimum allowed value: 1 (Aws::ElasticBeanstalk::Errors::ConfigurationValidationException)
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.1.1/lib/aws-sdk-core/plugins/param_converter.rb:21:in `call'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.1.1/lib/seahorse/client/plugins/response_target.rb:18:in `call'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.1.1/lib/seahorse/client/request.rb:70:in `send_request'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/aws-sdk-core-2.1.1/lib/seahorse/client/base.rb:207:in `block (2 levels) in define_operation_methods'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/aws_driver/beanstalk.rb:25:in `update_environment_settings'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/throttling_handling.rb:13:in `block in method_missing'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/utils.rb:13:in `backoff'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/throttling_handling.rb:12:in `method_missing'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/eb_environment.rb:37:in `block in apply_settings'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/eb_environment.rb:120:in `with_polling_events'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/eb_environment.rb:36:in `apply_settings'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/deployment_strategy/blue_green.rb:29:in `deploy'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/default_component.rb:16:in `deploy'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/environment.rb:32:in `block in deploy'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/environment.rb:31:in `each'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer/environment.rb:31:in `deploy'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer.rb:214:in `deploy'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/lib/eb_deployer.rb:257:in `cli'
from /opt/rubies/2.1.2/lib/ruby/gems/2.1.0/gems/eb_deployer-0.6.0.beta5/bin/eb_deploy:11:in `<top (required)>'
from /opt/boxen/rbenv/versions/2.1.2/bin/eb_deploy:23:in `load'
from /opt/boxen/rbenv/versions/2.1.2/bin/eb_deploy:23:in `<main>'
Nevermind the above error - I need to dig further into this as it appears the rolling update settings can't be used in conjunction with this approach. That being said, with this approach you are still left with a 'RED' environment and a Load Balancer that you are paying for. It would still be preferable to remove the old environment altogether. What are your thoughts?
Hi, @mefellows:
I only tried MinSize=0 in us-east-1 region. I may try on other regions to verify once I got some time.
I think the left over ELB is the minimal cost we need to pay to keep the blue green deployment relative safe. But I may be too conservative. So if you feel in need of this strongly please change the pull request to make it an blue-green deployment configuration option instead of separated deployment strategy. E.g an top-level configuration field like "blue-green-terminate-inactive". Then I will be happy to merge it in.
Thanks @wpc, I'll look at this today and submit an updated PR.
Closing in favour of #52
This feature introduces a new Strategy, which is really a fresh take on the Blue Green approach. In short, in many cases its ideal to remove the inactive stack post deployment - generally for keeping costs down.
This strategy follows the Blue Green approach, but will tear down a stack after a successful migration to the new 'Red' Environment, marking the old one as 'black' and terminating.
It should be noted that I considered modifying the existing Blue-Green strategy, but felt it was best to separate for the following reasons:
What are your thoughts?