Azure / karpenter-provider-azure

AKS Karpenter Provider
Apache License 2.0
308 stars 46 forks source link

fix: Updated instanceTypeZones logic to handle non-zonal SKUs in zonal regions #410

Closed rakechill closed 2 weeks ago

rakechill commented 2 weeks ago

Description I'm working with the Kaito team to integrate with Karpenter Azure Provider. They noticed that they were unable to use some GPU skus due to "insufficient capacity", but were able to confirm that the capacity was available and that this wasn't happening when provisioning nodes directly using AKS APIs.

It turns out that this was due to a small bug in retrieving availability zones for SKUs. We properly handle the case for non-zonal regions, but do not properly handle non-zonal offerings within zonal regions. I've updated the condition to handle this scenario.

Including Alex's comments wrt these non-zonal instance types here for future reference:

There are zonal and non-zonal VMs (even in zone-capable regions). This provider (unlike AKS RP) currently does not have good support for non-zonal VMs, we always provision zonal - except in non-zonal regions, where there is no choice. Longer term we may need to revisit this to see if overall support for non-zonal VMs should be improved (as well as maybe support for zone=any).

Apparently some VM SKUs/sizes are not available as zonal offerings, so have to be provisioned as non-zonal. This is what this PR is supposed to fix. (The proposed logic still falls back to non-zonal provisioning only if there are no zonal offerings.)

One still has to be careful about how this is surfaced in the resource API / skewer. The response for each VM size lists zones both under LocationInfo and RestrictionInfo (and some instances are weird, e.g. listing no zones in LocationInfo and all zones in RestrictionInfo). Skewer's AvailabilityZones appears to do the right thing, returning only zones that are "available and unrestricted". SKUs/sizes that are just not available in a location presumably would have location restriction (we filter them out elsewhere), or just won't show up. What is not clear to me, is whether non-zonal offering is always available (as long as SKU/size is in the list); would be good to check - especially since the proposed logic makes this assumption. This will also be important longer term, if we decide to expand support for non-zonal offerings alongside zonal, and not just as fallbacks.

How was this change tested?

Does this change impact docs?

Release Note

NONE
coveralls commented 2 weeks ago

Pull Request Test Coverage Report for Build 9576170731

Details


Totals Coverage Status
Change from base Build 9573708617: 0.0%
Covered Lines: 36287
Relevant Lines: 37116

💛 - Coveralls
coveralls commented 2 weeks ago

Pull Request Test Coverage Report for Build 9576589256

Details


Files with Coverage Reduction New Missed Lines %
pkg/providers/instancetype/instancetypes.go 1 88.37%
pkg/cache/unavailableofferings.go 2 95.45%
<!-- Total: 3 -->
Totals Coverage Status
Change from base Build 9573708617: -0.008%
Covered Lines: 36286
Relevant Lines: 37118

💛 - Coveralls
coveralls commented 2 weeks ago

Pull Request Test Coverage Report for Build 9584893711

Details


Files with Coverage Reduction New Missed Lines %
pkg/providers/instancetype/instancetypes.go 1 88.37%
pkg/cache/unavailableofferings.go 2 95.45%
<!-- Total: 3 -->
Totals Coverage Status
Change from base Build 9573708617: -0.008%
Covered Lines: 36286
Relevant Lines: 37118

💛 - Coveralls
coveralls commented 2 weeks ago

Pull Request Test Coverage Report for Build 9588996632

Details


Files with Coverage Reduction New Missed Lines %
pkg/providers/instancetype/instancetypes.go 1 88.37%
<!-- Total: 1 -->
Totals Coverage Status
Change from base Build 9587074007: -0.002%
Covered Lines: 36288
Relevant Lines: 37118

💛 - Coveralls
coveralls commented 2 weeks ago

Pull Request Test Coverage Report for Build 9589013756

Details


Files with Coverage Reduction New Missed Lines %
pkg/providers/instancetype/instancetypes.go 1 88.37%
<!-- Total: 1 -->
Totals Coverage Status
Change from base Build 9587074007: -0.002%
Covered Lines: 36288
Relevant Lines: 37118

💛 - Coveralls
coveralls commented 2 weeks ago

Pull Request Test Coverage Report for Build 9601903041

Details


Files with Coverage Reduction New Missed Lines %
pkg/providers/instancetype/instancetypes.go 1 88.3%
<!-- Total: 1 -->
Totals Coverage Status
Change from base Build 9587074007: -0.002%
Covered Lines: 36287
Relevant Lines: 37117

💛 - Coveralls