Open dewjam opened 1 year ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.
We need to re-consider how we do AZ selection on cluster / nodegroup creation in order to support the scenario above. ATM we only select zones that support all required instance types which runs into this scenario, instead we could consider each required instance type being supported into at least 1 AZ approach.
Looks like this also affects our trainium integration test, in which we need to manually provide the AZs. Let's remove that bit as part of this ticket.
This is an interesting feature. We'll do a spike first to understand how to best support this and what's involved in delivering this.
Spike: Timebox: 1-2 days Outcome: Come up with a proposal documented here for implementation.
The current AZ selection / validation behaviour is slightly inconsistent depending whether the AZs are user defined or automatically selected by eksctl
. In the former scenario, we validate the user defined selection by checking that each required instance type is available in at least one of the AZs. However, when automatically discovering AZs, we only select those that support all required instance types. This latter approach is much more restrictive, and for the instance types that are supported in a single AZ, we end up not being able to select at least 2 and hence returning an error. What we would want instead, is to be able to select the one AZ that supports the instance type in question, together with any other ones.
Taking the above into account, we need an algorithm that randomly selects a subset of available AZs, such that the union of instance types offerings of those AZs, includes the set of required instance types.
Say that in the selected AWS region there are N availability zones, out of which we want to randomly select K for our cluster (as per current implementation K needs to be >= 2, ideally 3). For the sake of exemplifying the algorithm step by step, let’s assume that N=4 and K=3.
For each AZ, determine the instance type offerings and store the results in a variable of the form map[AZ][]InstanceType
e.g.
instanceTypeOfferings = map[string][]string{
"AZ1": {"instanceType1"},
"AZ2": {"instanceType2", "instanceType3"},
"AZ3": {"instanceType1", "instanceType4"},
"AZ4": {"instanceType1", "instanceType2", "instanceType3"},
}
Generate all combinations of N choose K and store the results in a list, e.g.
nCk = [][]string{
{"AZ1", "AZ2", "AZ3"},
{"AZ1", "AZ2", "AZ4"},
{"AZ1", "AZ3", "AZ4"},
{"AZ2", "AZ3", "AZ4"},
}
Randomly sort the list created at step 2
, this will ensure that the zones are being picked randomly, e.g.
nCkRandomized = [][]string{
{"AZ1", "AZ2", "AZ4"},
{"AZ1", "AZ3", "AZ4"},
{"AZ1", "AZ2", "AZ3"},
{"AZ2", "AZ3", "AZ4"},
}
Iterate through the randomly sorted list, and for each element, generate the union of instance types offerings e.g.
offerings[{"AZ1", "AZ2", "AZ4"}] = {"instanceType1", "instanceType2", "instanceType3", "instanceType4"}
Check if the union of offerings contains all required instance types, and if so, return the AZ selection. Otherwise continue searching. If no combination of AZs supports all instance types, return an error.
The asymptotic time and space complexity of this approach are dictated by generating nCk, which is O(n ^ min(k,n-k))
. However, N and K are going to be small, 1 digit numbers, which would essentially mean this can be done in constant time.
Combinations generator snippet https://gist.github.com/TiberiuGC/0d8e5035793e1cc7ec8d1068ecab99fc
What feature/behavior/change do you want?
If an instance type is only available in a single availability zone in a given region, then randomly select the remaining AZs to meet the minimum AZ count requirement for an EKS cluster.
Why do you want this feature?
Trn1 instance types are only available in a single zone in us-east-1 and us-west-2. As a result, when launching a cluster with a Trn1 node group via
eksctl
, an error is returned:In order to create a cluster successfully, you must discover the zone where Trn1 instance types are supported and then provide at least one additional zone (in the example below, I added us-west-2b). This results in a clusterConfig that looks like this:
I would like eksctl to handle this case gracefully and just randomly select the remaining AZs (for the cluster only) until the minimum number of AZs is met.