eksctl-io / eksctl

The official CLI for Amazon EKS
https://eksctl.io
Other
4.94k stars 1.41k forks source link

Add support for ASG Multi Instance Types & Purchase Options #320

Closed peerjako-aws closed 5 years ago

peerjako-aws commented 6 years ago

AWS recently released EC2 Auto Scaling Groups with multiple instance types & Purchase options: https://aws.amazon.com/blogs/aws/new-ec2-auto-scaling-groups-with-multiple-instance-types-purchase-options/

It would be great if these new options could be configured on the worker node autoscaling group through the eksctl tool.

This would make it easy to get started with spot instance on the EKS cluster.

errordeveloper commented 6 years ago

Thanks, this is interesting! Would this be only about spot instances, or there are other major use cases? If so, how would this be different from having multiple nodegroups with different instance types in each?

peerjako-aws commented 6 years ago

Multiple node groups was the "old" way of doing this however there was a lot of complexities getting that to work well. This new release is a major simplification improvement from a user perspective, so thats why having direct support for this could be great.

Its not just a spot-instance thing but spot-instance scenarios really are where you will get a huge benefit out of this new feature. I think this new feature will greatly increase how many customers dare to use spot-instances for their workloads because it becomes so easy to set up in a highly reliable fashion.

errordeveloper commented 6 years ago

I see. So should we stop working on support for multiple nodegroups? Or nodegroups would still be needed to use separate subnets, SGs, different AMIs, disable/enable SSH, and separate tags?

On Tue, 20 Nov 2018, 9:27 am peerjako-aws, notifications@github.com wrote:

Multiple node groups was the "old" way of doing this however there was a lot of complexities getting that to work well. This new release is a major simplification improvement from a user perspective, so thats why having direct support for this could be great.

Its not just a spot-instance thing but spot-instance scenarios really are where you will get a huge benefit out of this new feature. I think this new feature will greatly increase how many customers dare to use spot-instances for their workloads because it becomes so easy to set up in a highly reliable fashion.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/weaveworks/eksctl/issues/320#issuecomment-440171170, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPWSzjFgKzz2xIfEhilTM4O_KLoQBM-ks5uw67ngaJpZM4YlvTa .

peerjako-aws commented 6 years ago

You would still need multiple nodegroups for all those things.

errordeveloper commented 6 years ago

Thanks! Do you have any suggestions on how flags might look like for 'eksctl create cluster'?

On Tue, 20 Nov 2018, 10:40 am peerjako-aws, notifications@github.com wrote:

You would still need multiple nodegroups for all those things.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/weaveworks/eksctl/issues/320#issuecomment-440188909, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPWS20VxtKwAYT9noKZIofRmEiEV2mlks5uw7_xgaJpZM4YlvTa .

peerjako-aws commented 6 years ago

Maybe the aws cli can be used for inspiration with the "--mixed-instances-policy" flag: https://docs.aws.amazon.com/cli/latest/reference/autoscaling/create-auto-scaling-group.html

errordeveloper commented 6 years ago

Thanks! I will look into it once I get back to work, on holidays for another two weeks :)

On Tue, 20 Nov 2018, 11:26 am peerjako-aws, notifications@github.com wrote:

Maybe the aws cli can be used for inspiration with the "--mixed-instances-policy" flag:

https://docs.aws.amazon.com/cli/latest/reference/autoscaling/create-auto-scaling-group.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/weaveworks/eksctl/issues/320#issuecomment-440203040, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPWS7P9cYjyl2ZDyj4YJANy3hqGyYMOks5uw8rVgaJpZM4YlvTa .

mumoshu commented 6 years ago

I think this succeeds #199 and #198, right?

mumoshu commented 6 years ago

Also - I'd rather suggest allowing users to override the whole asg definition perhaps by passing something like --autoscaling-group-overrides JSON, so that you won't be overwhelmed by requested to expose every option possible in cfn.

I did suffer as such while maintaining kube-aws.

From another POV, users like me would also be happy because they don't need to write go and build a binary themselves every time eksctl misses certain things customizable at cfn-level.

WDYT?

errordeveloper commented 5 years ago

With config file support we can enable this as advanced feature that requires use of the config file.

schottsfired commented 5 years ago

For what it's worth, it's possible to make this work by using the "Copy to Launch Template" button in the AWS console against the eksctl-provided Launch Configuration. Once that's done, reference the new Launch Template in the eksctl-provided AutoScaling Group.

On another note, I spent far too much time trying to decode the AMI user data while doing this 😉

errordeveloper commented 5 years ago

@schottsfired user-data is gzip+base64 encoded, it's not meant for users to look at. You can, of course, decode it, as we cannot stop you from using AWS in the way you like... However, it's not possible for us to support clusters that have ad-hoc changes that were made even outside CloudFormation.

errordeveloper commented 5 years ago

So it sounds like this feature is particularly useful for spot instances, I've been told it will make it somewhat easier to use the spot instances. More details would be helpful for us to priorities this, especially hearing about actual use-cases that people have.

yhvh commented 5 years ago

Actual use-case:

We would like to use spot instances for that which is not required to be HA, with the goal of lowering costs.

dcherman commented 5 years ago

FWIW, I believe the usage of multiple instance types w/ cluster-autoscaler is somewhat unsupported right now since CA expects all nodes in a given ASG to be of the same shape. If you include instance types that are notably either larger/smaller than one another, then it's likely that CA will not make correct/optimal scaling decisions (today).

https://github.com/kubernetes/autoscaler/issues/838 https://github.com/kubernetes/autoscaler/pull/1473 https://github.com/kubernetes/autoscaler/issues/1519

You can hack around it by providing different instance types that are very close in shape (someone mentioned using instances with similar properties like m5a.large, m5.large, m5d.large, m4.large), but I'm not sure whether or not that's a good idea versus using different nodegroups with a single instance type per each and getting the least cost expander in CA to work w/ AWS since it doesn't look like it's implemented today

https://github.com/kubernetes/autoscaler/blob/52e2cf4e46a2415307c6a12bf2ea878e38f552c2/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go#L103-L105

errordeveloper commented 5 years ago

Thanks a lot for the insight @dcherman, this is very helpful indeed!

ranshn commented 5 years ago

@errordeveloper

So it sounds like this feature is particularly useful for spot instances, I've been told it will make it somewhat easier to use the spot instances. More details would be helpful for us to priorities this, especially hearing about actual use-cases that people have.

When using Spot Instances, it is best practice to use multiple capacity pools in a clustered or distributed workload. a capacity pool is a combination of instance type + availability zone. So if users create a new ASG today, and specify three instance types (e.g c4.large, m4.large, r4.large) in three availability zones, ASG will fulfill capacity from these 9 capacity pools. if that ASG has Spot Instances in it, working with 9 capacity pools will increase the chances of getting the desired Spot capacity, as well as decrease the impact on the application in case EC2 reclaims capacity in one of the capacity pools (Spot interruption) because only a smaller portion of the capacity will be interrupted, and ASG will replenish capacity from the other capacity pools.

re: use-cases - users run any stateless and fault tolerant applications on Spot, many of them running 100% on Spot. for example Skyscanner talked about putting their production workloads on Spot with Kubernetes in a meetup lately: https://www.meetup.com/skyscanner/events/258492358/ Zalando has also been talking about running Kubernetes on Spot Instances: http://aws-de-marketing.s3-eu-central-1.amazonaws.com/Field%20Marketing/Community-Days/AWS-Community-Day-2018_Container_Using%20Spot%20Instances%20with%20Kubernetes.pdf

Basically any stateless service in the cluster can be placed on Spot instances, and if customers diversify their instance types properly, there's no reason why not to run production applications with high SLA requirements, customers have been doing this at very large scale outside of Kuberentes for years, and we're seeing increased adoption of K8s + Spot lately.

The caveat is that instance size diversification + cluster-autoscaler doesn't fit today, as mentioned in this thread already because CA is assuming that a node group is homogeneous. This solution will still benefit users with clusters that don't use CA, or, as mentioned, just diversify between instance types with similar hardware characteristics. customers have been doing this outside of containers with multiple ASGs for years, when they put their web workloads behind ELBs and needed to decrease performance variability between the different instance types, but still diversify in order to get and keep Spot capacity.

errordeveloper commented 5 years ago

@ranshn thanks for the info on capacity pools. I understand the use cases with regards to workloads, but what I am not exactly clear on is what would be a simple set of parameters to provide to the user that would give a lot of value with a minimum effort for them? It sounds like there are a lot of way to go about, is there a simplified thing that we can enable for the most typical use-case? I am happy to expose advanced parameters, but I would like to understand if basic parameters are still of great use? let's say we were to add 2 or 3 new flags in 'eksctl create cluster', what would those be?

ranshn commented 5 years ago

@errordeveloper that's a hard call, it might be ok to just configure the Launch template, instance types, number of spot pools, and OD percentage (to actually configure 100% Spot, for example). both OnDemandAllocationStrategy and SpotAllocationStrategy only have one possible value anyway

Terraform, for example, did add support for all the parameters: https://www.terraform.io/docs/providers/aws/r/autoscaling_group.html#mixed_instances_policy-instances_distribution

kops seem have added some support (not released yet) but I'm not sure what will actually be released: https://github.com/kubernetes/kops/blob/master/docs/instance_groups.md#creating-a-instance-group-of-mixed-instances-types-aws-only

I think @mumoshu and @peerjako-aws had a great suggestion in just letting the user provide the configuration file for the ASG. here's a quick example of such json file for AWS CLI https://github.com/awslabs/ec2-spot-workshops/blob/master/workshops/running-amazon-ec2-workloads-at-scale/asg.json