aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.7k stars 937 forks source link

Tagged helm chart values in `values.yaml` don't match image tag #5415

Open jonathan-innis opened 9 months ago

jonathan-innis commented 9 months ago

Description

** READ BEFORE CONTINUING: If your issue is not specific to AWS, please cut a ticket in kubernetes-sigs/karpenter.

Observed Behavior:

Currently, our CD rollout for new karpenter-provider-aws images is triggered based off a git tag being pushed to the upstream repo. This push to the upstream repo results in the following operations happening:

  1. Github Release is created
  2. Images and Charts are packaged and pushed to the public ECR
  3. Website changes are generated and PR is created with the changes

The order of this release process means that the changes to the repo are merged after the repo has already been tagged. What this means, is that the chart that is stored in the repo that is tagged v0.32.1 is different than the pushed chart that ends up in the OCI artifact store in our public ECR.

In general, this shouldn't happen. The tag should be our source-of-truth, implying that whatever we tag in the repository has the same representation for the chart and the image with the same tag in the ECR. This is currently true for the image, but is not true for the chart.

I'd like to see us re-work the release process so that what we tag is also what we push. Practically, this means that we need changes to the chart to be created and merged-in first before the actual tag creation happens.

pedroapero commented 8 months ago

Hello! This is problematic because currently, listing chart tags in OCI registries is not supported by helm (https://github.com/helm/helm/issues/11000). Where is the reference list of tags for helm charts and images now that index.html is unmaintained and branch numbers don’t match?

edas-smith commented 8 months ago

Definitely something that would be good to get fixed. Can't upgrade from 0.32+ to 0.33+ :/

stevehipwell commented 8 months ago

I think this is related to https://github.com/aws/karpenter-provider-aws/issues/4248.

noamgreen commented 6 months ago

kill me why its so hard to understand your repo !! branch "release-v0.35.4" it not same as tag v0.35.4

argocd cant download helm as missing index or index not config ,

helm pull --version release-v0.35.4 --repo https://github.com/aws/karpenter charts/karpenter 404 Not Found helm pull --version v0.34.3 --repo https://github.com/aws/karpenter charts/karpenter 404 Not Found Error: looks like "https://github.com/aws/karpenter" is not a valid chart repository or cannot be reached: failed to fetch https://github.com/aws/karpenter/index.yaml : 404 Not Found

helm not support OCI and you push to OCI do you want us to work with karpenter ?

Thansk

edas-smith commented 5 months ago

Has there been any update on this? Its making it very challenging to upgrade to the newer versions due to this mismatch.

stevehipwell commented 5 months ago

@noamgreen Karpenter is only published via OCI.

@edas-smith the Karpenter Helm chart now uses a correct SemVer.

helm pull oci://public.ecr.aws/karpenter/karpenter --version 0.36.0
paul-civitas commented 4 months ago

As I mentioned in my duplicate ticket, I was able to work around this issue by pointing to branch release-v0.36.2 rather than the tag v0.36.2

stevehipwell commented 3 months ago

@paul-civitas I think there might be some confusion between the Git tag (idiomatically the version with a v prefix) and the OCI tags for the image and Helm chart?

Git tags can't start with a number, thus the idiom of prefixing with a v, while the OCI tag for a Helm chart should be a SemVer version, which can't start with a v. As an OCI image tag is completely open you'll see it take both SemVer and v prefixed forms.

In the case of the Karpenter project all of the OCI tags are SemVer, for consistency, so if you have the Git tag you just need to remove the v prefix.

paul-civitas commented 3 months ago

@stevehipwell well the fundamental issue is that the OCI tag for the helm chart, that the version with the critical fix is not in existence.

Or at least it was not at the time.

stevehipwell commented 3 months ago

@stevehipwell well the fundamental issue is that the OCI tag for the helm chart, that the version with the critical fix is not in existence.

Or at least it was not at the time.

@paul-civitas the OCI tags (0.36.2) were created as part of the GH release 20 days ago (see registries for karpenter/controller & karpenter/karpenter).

You can check that they're valid with the following commands. Please note the lack of a v prefix on the tag (as explained in my comment above).

docker pull public.ecr.aws/karpenter/controller:0.36.2
helm pull oci://public.ecr.aws/karpenter/karpenter --version 0.36.2
paul-civitas commented 3 months ago

@stevehipwell sorry I misspoke. The issue is that there was not a git tag that had a helm chart that had a value pointing to any image with this OCI tag.

paul-civitas commented 3 months ago

Here's a simple question, that might show what the problem is.

Question: What git tag to I point argoCD to, such that I'm using a version of the helm chart, that is pointing to OCI tag 0.36.2 of the OCI image in values.yaml?

paul-civitas commented 3 months ago

Here's a simple question, that might show what the problem is.

Question: What git tag to I point argoCD to, such that I'm using a version of the helm chart, that is pointing to OCI tag 0.36.2 of the OCI image in values.yaml?

Following up on this.

Git tag v0.36.2 is the wrong answer, as you can see here: https://github.com/aws/karpenter-provider-aws/blob/v0.36.2/charts/karpenter/values.yaml#L102 it points to OCI tag 0.35.4 which is an older version than the one I desire.

Git tag v0.37.0 (the most recent tag) is the wrong answer, as you can see here: https://github.com/aws/karpenter-provider-aws/blob/v0.37.0/charts/karpenter/values.yaml#L104 that it points to OCI tag 0.36.0 which is an older version than the version that I desire.

So what is the right answer?

stevehipwell commented 3 months ago

@paul-civitas the Git tag just isn't relevant here (it'll be tracking the previous version of the Helm chart as it triggers the release process).

I'm not an Argo CD expert but it sounds like you've not configured it correctly for OCI if you're trying to use a Git tag. Could you share your config?

paul-civitas commented 3 months ago

the Git tag just isn't relevant here

The git tag is what this issue is about. The git tag tracking the previous version of the helm chart, rather than the version corresponding to the tag, is unusual behavior that is throwing us off, and contrary to how other open source projects work That is why issues are being created.

I'm not an Argo CD expert but it sounds like you've not configured it correctly

It's pretty straightforward. You point ArgoCD to a git reference and it deploys the helm chart from git. My expectation was that by pointing ArgoCD to the git reference v0.36.2 it would deploy version 0.36.2 of the code. However it actually deployed version 0.35.4 which was contrary to exceptions.

I was able to resolve this issue by pointing ArgoCD to git reference release-v0.36.2 which does deploy version 0.36.2 of the code.

stevehipwell commented 3 months ago

the Git tag just isn't relevant here

The git tag is what this issue is about. The git tag tracking the previous version of the helm chart, rather than the version corresponding to the tag, is unusual behavior that is throwing us off, and contrary to how other open source projects work That is why issues are being created.

@paul-civitas the Git tag triggers the Helm chart OCI release which will result in an additional commit to save the current state of the chart after the release; having the chart content be part of the release tag commit isn't expected behaviour (this is idiomatic where the chart and binary share a repo due to a chicken/egg problem). Furthermore the Helm chart releases are only provided via the OCI registry (as documented); the HTTP Helm registry was deprecated (I think in the `0.17.0 release), and any behaviour linking a Helm release to a Git release tag was incidental implementation detail (likely due to non-automated releases being cut on a developer's machine).

It's pretty straightforward. You point ArgoCD to a git reference and it deploys the helm chart from git. My expectation was that by pointing ArgoCD to the git reference v0.36.2 it would deploy version 0.36.2 of the code. However it actually deployed version 0.35.4 which was contrary to exceptions.

I'm going to have to say that it isn't that simple, as you're using it incorrectly in this case. The two documented Helm targets are HTTP registry (first example) and OCI (second example). You can use a Git reference as you're attempting to do, but it's only going to work correctly where supported (and it isn't supported for Karpenter).

I was able to resolve this issue by pointing ArgoCD to git reference release-v0.36.2 which does deploy version 0.36.2 of the code.

That might work but it's an implementation detail which could be changed at an point, please configure Argo CD to use the Karpenter OCI registry.

paul-civitas commented 3 months ago

Please configure Argo CD to use the Karpenter OCI registry.

I know Amazon is trying to grow adoption of their OCI registry, but it currently has serious usability challenges.

This would be a practical suggestion if you were to instead move to a more stable and widely used public OCI registry.

stevehipwell commented 3 months ago

I know Amazon is trying to grow adoption of their OCI registry, but it currently has serious usability challenges.

This would be a practical suggestion if you were to instead move to a more stable and widely used public OCI registry.

@paul-civitas could you expand on this? I'm not aware of any current issues with ECR public (ECR private has issues with figuring out the correct regional AWS account but this is more UX than anything else and not what Karpenter uses). Based on the number of EKS Kubernetes clusters in use (I remember a number like 70% passed around a couple of years ago) I'd say that ECR is probably one of most widely used public OCI registries.

FYI I don't work for Amazon.

paul-civitas commented 3 months ago

It might be fine but last time we tried to use it they required authentication for public access, and it expired after 24 hours. Which didn't play nice with argoCD, as argo wanted either anonymous access or access with fixed credentials for OCI chart registries.

https://github.com/argoproj/argo-cd/issues/8097

In order to manage this people had to set up a cronjob to rotate a secret every day, and plug that secret into argo. There was discussion on making this less painful but it was closed as "not planned." So it's way way simpler to just point things at git.

stevehipwell commented 3 months ago

@paul-civitas you don't need any auth for public ECR, the Karpenter docs (the ones I linked above) make this clear. FYI the issue link above is for private ECR registries (with the <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com pattern) which AWS use for some of their OSS components but not Karpenter.

paul-civitas commented 3 months ago

Ok thanks. Public ECR was only launched in 2020 so some of my experience is probably out of date. I'll give using the registry another shot.

soupdiver commented 2 weeks ago

Just ended up here after some good amount of debugging.... I can confirm that having tags 0.x.x and release-0.x.x NOT pointing to the same thing in the end is 💩 Very technically speaking yea, you might have a point but reality wise this is hard to understand