Azure / karpenter-provider-azure

AKS Karpenter Provider
Apache License 2.0
376 stars 57 forks source link

feat: supporting 1.30 bootstrap via enabling out of tree credential provider #370

Closed Bryce-Soghigian closed 4 months ago

Bryce-Soghigian commented 4 months ago

Fixes # NA

Description Bootstrap for 1.30 will fail if we have the --azure-container-registry-config flag specified.

azureuser@aks-general-purpose-78r6l:~$ journalctl -u kubelet.service
---
May 10 08:31:26 aks-general-purpose-78r6l kubelet[10604]: E0510 08:29:13.285779   10604 run.go:74]"command failed" err="failed to parse kubelet flag: unknown flag: --azure-container-registry-config"
May 10 08:31:26 aks-general-purpose-78r6l kubelet[13211]: E0510 08:31:26.037146   13211 run.go:74] "command failed" err="failed to parse kubelet flag: unknown flag: --azure-container-registry-config"
May 10 08:31:26 aks-general-purpose-78r6l systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
May 10 08:31:26 aks-general-purpose-78r6l systemd[1]: kubelet.service: Failed with result 'exit-code'.

By removing this flag we get nodes for k8s version 1.30 to bootstrap successfully.

How was this change tested?

Does this change impact docs?

Release Note

support for 1.30 node bootstrapping with karpenter
Bryce-Soghigian commented 4 months ago

Copying feedback on the PR with messy git history tallaxes left a comment • This particular fix looks good to me, with some suggestions (left comments), and the following caveats:

There are likely more changes related to the switch to out-of-tree providers (such as setting DisableKubeletCloudCredentialProviders feature gate depending on Kubernetes version), would make sense to package and test this all together Need to actually test it on 1.30 - including ACR image pull. (Ideally this would be covered by E2E) Need evidence of E2E tests passing

Bryce-Soghigian commented 4 months ago

There are likely more changes related to the switch to out-of-tree providers (such as setting DisableKubeletCloudCredentialProviders feature gate depending on Kubernetes version), would make sense to package and test this all together

We discussed offline, seems the feature gate is removed and defaulted correctly in the kubelet binaries so no need for change here.

Need to actually test it on 1.30 - including ACR image pull. (Ideally this would be covered by E2E) I ran the e2e test paired with setting the k8s version in the k8s version func in image family. IT fails with the following errors

Warning FailedCreatePodSandBox 2m42s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "<sandbox_id>": plugin type="cilium-cni" failed (add): failed to invoke delegated plugin ADD for IPAM: http request failed: Post "http://<ipam_service_address>": dial tcp <local_ip>:: connect: connection refused; failed to request IP address from CNS
Normal SandboxChanged 98s (x6 over 2m41s) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulling 44s (x3 over 98s) kubelet Pulling image "<image_registry>/pause:3.6"
Warning Failed 44s (x3 over 88s) kubelet Failed to pull image "<image_registry>/pause:3.6": failed to pull and unpack image "<image_registry>/pause:3.6": failed to resolve reference "<image_registry>/pause:3.6": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://<image_registry>/oauth2/token?scope=repository%3Apause%3Apull&service=<image_registry>: 401 Unauthorized
Warning Failed 44s (x3 over 88s) kubelet Error: ErrImagePull
Normal BackOff 20s (x4 over 87s) kubelet Back-off pulling image "<image_registry>/pause:3.6"
Warning Failed 20s (x4 over 87s) kubelet Error: ImagePullBackOff

I need to investigate why and if there is more needed to supporting the out of tree providers work.

Need evidence of E2E tests passing

I ran the e2e tests here https://github.com/Azure/karpenter-provider-azure/actions/runs/9228266167. Ideally we could run them against 1.30 but as its not out yet no real way to test on 1.30 for apiserver from our e2e tests. Best we can do is set the kubelet version to 1.30.

coveralls commented 4 months ago

Pull Request Test Coverage Report for Build 9324673471

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/providers/imagefamily/bootstrap/aksbootstrap.go 7 10 70.0%
<!-- Total: 10 13 76.92% -->
Files with Coverage Reduction New Missed Lines %
pkg/cache/unavailableofferings.go 2 95.45%
<!-- Total: 2 -->
Totals Coverage Status
Change from base Build 9074172591: -0.01%
Covered Lines: 36284
Relevant Lines: 37115

💛 - Coveralls
Bryce-Soghigian commented 4 months ago

Pasting this here for reference: https://github.com/Azure/AgentBaker/pull/4258/files @rakechill for Sov cloud support, we will need to pass in AKS_CUSTOM_CLOUD_CONTAINER_REGISTRY_DNS_SUFFIX="{{- if IsAKSCustomCloud}}{{AKSCustomCloudContainerRegistryDNSSuffix}}{{end}}" added in this pr. If we don't migrate to the contract in time for GA that is.

Bryce-Soghigian commented 4 months ago

I ran the end to end suite for ACR pull locally.

[endgroup]

< Exit [AfterEach] TOP-LEVEL - /Users/sillygoose/dev/focus/karpenter-provider-azure/test/suites/acr/suite_test.go:43 @ 05/30/24 14:39:58.037 (45ms) • [369.389 seconds]

Ran 1 of 1 Specs in 369.598 seconds SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped --- PASS: TestAcr (369.60s) PASS ok github.com/Azure/karpenter-provider-azure/test/suites/acr 370.152s

We pass with nodes that are set to 1.30 for their k8s version. We should be good to get this one in if CI passes here.