Fix instance type string list for node groups

dspace-group / simphera-reference-architecture-aws

In order to deploy SIMPHERA to AWS, various cloud resources, such as a Kubernetes cluster, PostgreSQL database server, etc., need to be created. This repository contains a reference architecture for these AWS resources. You can use this Terraform configuration as a starting point to create these resources in your own AWS account.

MIT License

11 stars 4 forks source link

Fix instance type string list for node groups #137

Closed lukabudak closed 5 months ago

lukabudak commented 5 months ago

Instance type list is fixed with the selected instance types that are compatible with k8s and workload (e.g. cpu type, cpu and memory requests; 16 Cores, 64 GiB). The list is ordered in a way, so that the most cost efficient ec2 instance type is selected first.

lukabudak commented 5 months ago

If the first type from list is not available, does it use next one or it throw error during terraform apply?

Terraform raises an error (InvalidParameterException) due to unavailable instance types, even though there are instance types in the list of available ones. I will attempt to avoid throwing this exception.

schwichti commented 5 months ago

Instance type list is fixed with the selected instance types that are compatible with k8s and workload (e.g. cpu type, cpu and memory requests; 16 Cores, 64 GiB). The list is ordered in a way, so that the most cost efficient ec2 type is selected first.

How do you know that a given instance type is "compatible with k8s and workload"? Did you test every single instance type or did you do this by specification? If you did this by inspecting the specification, what are the exact criteria for selection?

schwichti commented 5 months ago

Some background information: var.linuxExecutionNodeSize etc. are list of instance types to fulfill the rule Auto Scaling groups should use multiple instance types in multiple Availability Zones.

schwichti commented 5 months ago

Please consider the Allocation strategies https://docs.aws.amazon.com/autoscaling/ec2/userguide/create-mixed-instances-group-manual-instance-type-selection.html. I believe the default setting is "Prioritized", i.e., "Request On-Demand Instances based on the priority order of instance types that you set [...]". I remember once when I had configured the instance types [m5a.xlarge, m5a.2xlarge] and m5a.2xlarge was taken, which I found counter-intuitive (and unnessessarily costly). I was not sure if this was because the last item in the list gets the highest priority or if the AWS allocation strategy did some magic behind the scenes (e.g. m5a.xlarge was currently not available or m5a.2xlarge was the better fit for the workload).

lukabudak commented 5 months ago

Instance type list is fixed with the selected instance types that are compatible with k8s and workload (e.g. cpu type, cpu and memory requests; 16 Cores, 64 GiB). The list is ordered in a way, so that the most cost efficient ec2 type is selected first.

How do you know that a given instance type is "compatible with k8s and workload"? Did you test every single instance type or did you do this by specification? If you did this by inspecting the specification, what are the exact criteria for selection?

I've tested each instance type by deploying the infrastructure and Simphera. The EC2 instance type specification I was going for is:

64 GB RAM
16 vCPUs
at least 110 IPs
at least 2 availability zones.

lukabudak commented 5 months ago

Please consider the Allocation strategies https://docs.aws.amazon.com/autoscaling/ec2/userguide/create-mixed-instances-group-manual-instance-type-selection.html. I believe the default setting is "Prioritized", i.e., "Request On-Demand Instances based on the priority order of instance types that you set [...]". I remember once when I had configured the instance types [m5a.xlarge, m5a.2xlarge] and m5a.2xlarge was taken, which I found counter-intuitive (and unnessessarily costly). I was not sure if this was because the last item in the list gets the highest priority or if the AWS allocation strategy did some magic behind the scenes (e.g. m5a.xlarge was currently not available or m5a.2xlarge was the better fit for the workload).

Yes, I concur that is a rather odd behaviour. Although when I tested, I always got the first instance type to be taken and deployed from the list. I'll investigate AWS Auto Scaling further because I'm not entirely sure how it works.

lukabudak commented 5 months ago

I've checked the terraform-aws-eks-blueprints and you were right @schwichti. Managed node group capacity type is set to On-Demand with allocation strategy as prioritized.

As a result of that, it will always choose the instance type to use first based on the list's instance types' order (first to last element).

schwichti commented 5 months ago

I've checked the terraform-aws-eks-blueprints and you were right @schwichti. Managed node group capacity type is set to On-Demand with allocation strategy as prioritized.

So it will always pick instance type to use first based on the order of instance types in the list (from first to last element).

Ok, when you can confirm that the first item in the array gets the highest priority I am fine. I believe that when I tried it I was victim of some AWS magic that was overriding my priorities.