eksctl-io / eksctl

The official CLI for Amazon EKS
https://eksctl.io
Other
4.83k stars 1.39k forks source link

[Feature] Support instance types that are only available in a single zone #5933

Open dewjam opened 1 year ago

dewjam commented 1 year ago

What feature/behavior/change do you want?

If an instance type is only available in a single availability zone in a given region, then randomly select the remaining AZs to meet the minimum AZ count requirement for an EKS cluster.

Why do you want this feature?

Trn1 instance types are only available in a single zone in us-east-1 and us-west-2. As a result, when launching a cluster with a Trn1 node group via eksctl, an error is returned:

Error: getting availability zones: only 1 zones discovered [us-west-2d], at least 2 are required

In order to create a cluster successfully, you must discover the zone where Trn1 instance types are supported and then provide at least one additional zone (in the example below, I added us-west-2b). This results in a clusterConfig that looks like this:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: trainium
  region: us-west-2
availabilityZones:
  - us-west-2b
  - us-west-2d
nodeGroups:
  - name: trainium
    instanceType: trn1.2xlarge
    availabilityZones:
      - us-west-2d
    minSize: 1
    maxSize: 2
    desiredCapacity: 2
    volumeSize: 40

I would like eksctl to handle this case gracefully and just randomly select the remaining AZs (for the cluster only) until the minimum number of AZs is met.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 5 days with no activity.

TiberiuGC commented 1 year ago

We need to re-consider how we do AZ selection on cluster / nodegroup creation in order to support the scenario above. ATM we only select zones that support all required instance types which runs into this scenario, instead we could consider each required instance type being supported into at least 1 AZ approach.

TiberiuGC commented 1 year ago

Looks like this also affects our trainium integration test, in which we need to manually provide the AZs. Let's remove that bit as part of this ticket.

https://github.com/weaveworks/eksctl/blob/8d3e078aa65a55d608c99be37f8f9c3c83c32e24/integration/tests/trainium/trainium_test.go#L211

Himangini commented 1 year ago

This is an interesting feature. We'll do a spike first to understand how to best support this and what's involved in delivering this.

Spike: Timebox: 1-2 days Outcome: Come up with a proposal documented here for implementation.

TiberiuGC commented 1 year ago

Intro

The current AZ selection / validation behaviour is slightly inconsistent depending whether the AZs are user defined or automatically selected by eksctl. In the former scenario, we validate the user defined selection by checking that each required instance type is available in at least one of the AZs. However, when automatically discovering AZs, we only select those that support all required instance types. This latter approach is much more restrictive, and for the instance types that are supported in a single AZ, we end up not being able to select at least 2 and hence returning an error. What we would want instead, is to be able to select the one AZ that supports the instance type in question, together with any other ones.


Goal of the new algorithm for automatic AZ selection


Taking the above into account, we need an algorithm that randomly selects a subset of available AZs, such that the union of instance types offerings of those AZs, includes the set of required instance types.

Algorithm proposal



Say that in the selected AWS region there are N availability zones, out of which we want to randomly select K for our cluster (as per current implementation K needs to be >= 2, ideally 3). For the sake of exemplifying the algorithm step by step, let’s assume that N=4 and K=3.

  1. For each AZ, determine the instance type offerings and store the results in a variable of the form map[AZ][]InstanceType e.g.

    instanceTypeOfferings = map[string][]string{
        "AZ1": {"instanceType1"},
        "AZ2": {"instanceType2", "instanceType3"},
        "AZ3": {"instanceType1", "instanceType4"},
        "AZ4": {"instanceType1", "instanceType2", "instanceType3"},
    }
  2. Generate all combinations of N choose K and store the results in a list, e.g.

    nCk = [][]string{
        {"AZ1", "AZ2", "AZ3"},
        {"AZ1", "AZ2", "AZ4"},
        {"AZ1", "AZ3", "AZ4"},
        {"AZ2", "AZ3", "AZ4"},
    }
  3. Randomly sort the list created at step 2, this will ensure that the zones are being picked randomly, e.g.

    nCkRandomized = [][]string{
        {"AZ1", "AZ2", "AZ4"},
        {"AZ1", "AZ3", "AZ4"},
        {"AZ1", "AZ2", "AZ3"},
        {"AZ2", "AZ3", "AZ4"},
    }
  4. Iterate through the randomly sorted list, and for each element, generate the union of instance types offerings e.g.

    offerings[{"AZ1", "AZ2", "AZ4"}] = {"instanceType1", "instanceType2", "instanceType3", "instanceType4"}
  5. Check if the union of offerings contains all required instance types, and if so, return the AZ selection. Otherwise continue searching. If no combination of AZs supports all instance types, return an error.

Time and space complexity

The asymptotic time and space complexity of this approach are dictated by generating nCk, which is O(n ^ min(k,n-k)). However, N and K are going to be small, 1 digit numbers, which would essentially mean this can be done in constant time.

TiberiuGC commented 9 months ago

Combinations generator snippet https://gist.github.com/TiberiuGC/0d8e5035793e1cc7ec8d1068ecab99fc