elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
23 stars 145 forks source link

[Discuss] Elastic Agent Helm Chart release process #5486

Open ycombinator opened 2 months ago

ycombinator commented 2 months ago

In https://github.com/elastic/elastic-agent/pull/5331, we added a Helm Chart for Elastic Agent. At the time, we deferred the discussion on how this chart should be released. Let's continue the discussion in this issue and try to arrive at a decision.

ycombinator commented 2 months ago

cc: @pkoutsovasilis @swiatekm @blakerouse @cmacknz

swiatekm commented 2 months ago

As I see it, we have two basic choices here:

  1. Release a Helm Chart version every time we release an agent version. The default agent image version is set to the Chart version.
  2. Release on a different schedule. There's either no default agent image (user needs to set it themself), or there's a default, but it may not be the latest agent image.

The important questions that should guide this choice are:

  1. How often do we expect to have semantic changes to the Chart, not related to the agent version?
  2. Are we ok with the user having to set the agent image? a) If we want a default, are we ok with the latest Chart version not having the latest agent version as the default? b) Are we confident in users' ability to upgrade the agent image independently of the Helm Chart version?
  3. Is it important for users to be able to use the Chart with agent versions older than the first Chart release?
  4. Is it a pain to release the Chart in lockstep with agent?

I don't feel like I know enough about the domain to have informed opinions about these myself. I think I'd default to option 1 just because it's simple, but it does sacrifice some flexibility.

pkoutsovasilis commented 2 months ago

I like in general the first choice that @swiatekm proposes above, but I would like to add a twist to it

Before elaborating more I will try to provide some answer to the raised questions

  1. How often do we expect to have semantic changes to the Chart, not related to the agent version?

Changes to the Helm chart can be driven by the following:

  1. Are we ok with the user having to set the agent image?

I would prefer to have as default one the one that the helm chart was tested against

  1. Is it important for users to be able to use the Chart with agent versions older than the first Chart release?

imo the helm chart should support the user specifying different agent versions/images but provide guarantees of nominal operation only with the default image that comes with it. Hence, following this guarantee, there will be chart versions that will support only a specific agent version and either the user needs to stay in an older version if they want to remain under an agent version or change with their own responsibility

  1. Is it a pain to release the Chart in lockstep with agent?

I wouldn't expect this to be more painless that a lockfree release with agent (last famous words of me)

My approach would be that the Helm chart releases can happen only from the main branch. Specifically we can fabricate a versions.json file inside the helm chart that captures all the required release info, namely chart_version, agent_version, etc. Then we can add the following CI steps:

The above is just a draft of an my initial thinking more than happy to drill down the details if it sounds reasonable to all the interested parties

swiatekm commented 2 months ago

@pkoutsovasilis do you mean literally just main, or also the release branches? If we don't do the latter, we won't have Helm Chart releases for agent patch versions.

pkoutsovasilis commented 2 months ago

@swiatekm I was thinking main but sure this can be adjusted to any branch if we deem such a wiser choice. The way I am thinking of it

imo the helm chart should support the user specifying different agent versions/images but provide guarantees of nominal operation only with the default image that comes with it

the helm chart will support agent versions incrementally thus when there is a patch version to the latest agent release I see no issue creating a PR that points the aforementioned version.json to the patch version and triggering all the necessary CI steps. But maybe I am missing something, so please do elaborate πŸ˜ƒ

cmacknz commented 2 months ago

Changes to the built-in kubernetes integration as it mostly reflects this package

How do we account for and include changes to the package? Automation that creates an appropriate PR in this repository that eventually leads to a new chart release?

I like the idea of just releasing from main as it is simpler, but I worry we'd need a way to indicate which agent versions new features are compatible with eventually like the stack version constraint in integrations packages.

It may be simpler to have a release of the helm chart for each active minor to eliminate this problem. If we want to account for package changes, we'd need a versioning scheme that can move faster than the stack release. We couldn't just have 8.16.1, 8.16.2, etc.

pkoutsovasilis commented 2 months ago

How do we account for and include changes to the package? Automation that creates an appropriate PR in this repository that eventually leads to a new chart release?

at the moment this is a manual process so the accounting happens manually. That said, if we deem that this becomes "unbearable" we could try and automate it; ps: transitioning from handlebars to helm templates is non-trivial πŸ˜„ One of the reasons this Helm chart started with only the kubernetes integration built-in is this one

I like the idea of just releasing from main as it is simpler, but I worry we'd need a way to indicate which agent versions new features are compatible with eventually like the stack version constraint in integrations packages. It may be simpler to have a release of the helm chart for each active minor to eliminate this problem. If we want to account for package changes, we'd need a versioning scheme that can move faster than the stack release. We couldn't just have 8.16.1, 8.16.2, etc.

In my thinking the helm chart provides guarantees only for the combination of agent_version and kubernetes_integration_version it comes with when it is released. In simpler terms, if the user wants to mix match different agent versions and the kubernetes integration, it is up to them to do so and they should follow the route of custom integration and not rely in the built-in one. Because the latter is guaranteed to work correctly only with the agent version that the released chart comes with. However, if you are not in favour of this, then I guess we need to alter this approach.

cmacknz commented 2 months ago

In my thinking the helm chart provides guarantees only for the combination of agent_version and kubernetes_integration_version it comes with when it is released.

I think requiring an agent upgrade to get a guarantee for a new kubernetes integration version goes against the way we want package releases to work where you don't have to upgrade agent outside of this scenario.

Often users cannot rapidly upgrade their Elastic Agent (because they have to upgrade ES, they have rigid compliance or QA procedures, etc) but they can upgrade their agent configuration or policy quickly.

Whatever we decide to do has to support upgrading the k8s integration without upgrading the agent version, I think that will be quite a common ask.

Users of this chart will frequently be k8s and Helm users but not k8s and Helm (or Elastic Agent) experts, we need to optimize for helping them get fixes with minimal toil or the need to escalate to engineering via support cases.

pkoutsovasilis commented 2 months ago

I think requiring an agent upgrade to get a guarantee for a new kubernetes integration version goes against the way we want package releases to work where you don't have to upgrade agent outside of this scenario.

Often users cannot rapidly upgrade their Elastic Agent (because they have to upgrade ES, they have rigid compliance or QA procedures, etc) but they can upgrade their agent configuration or policy quickly.

Whatever we decide to do has to support upgrading the k8s integration without upgrading the agent version, I think that will be quite a common ask.

Users of this chart will frequently be k8s and Helm users but not k8s and Helm (or Elastic Agent) experts, we need to optimize for helping them get fixes with minimal toil or the need to escalate to engineering via support cases.

ok I hear you πŸ™‚ Then let's do a minor alternation to the approach; helm chart releases happen from 8.x, 9.x branches, still following the same versions.json paradigm with CI steps, I mentioned above, but each one of these branches target for release a different minor version of the helm chart. As an example 8.16 branch of the agent would be the 0.1.1 helm chart version. Following this pattern, the 9.0 agent branch would be the 0.2.1 helm chart, etc. (speaking random semvers here πŸ˜…). I think this approach covers the cases you mention @cmacknz , right?

cmacknz commented 2 months ago

Yes that would work and let us backport fixes where applicable.

pkoutsovasilis commented 2 months ago

ok then my next step will be to draw this "flow" in a diagram and thus help all the interested parties to have a better understanding how this is gonna look like. Then if there are no strong objections I will proceed to coding it

pkoutsovasilis commented 4 weeks ago

I've experimented with a local MinIO as a Helm registry. Here are my findings:

Based on the above observations, this my proposed flow; each branch, namely 8.16, 8.x, and main, will feature a Helm chart with different chart and app versions in the respective Chart.yaml. The release process for each one will look like the following:

flowchart TD
    classDef default line-height:1.5,text-align:center;

    B[Manual: Bump chart version and set app version<br>in <a href='https://github.com/elastic/elastic-agent/blob/main/deploy/helm/elastic-agent/Chart.yaml'>Chart.yaml</a>]
    D[CI: Verify that agent <a href='https://github.com/elastic/elastic-agent/blob/main/version/version.go'>version</a> matches<br>the major.minor parts of the chart version<br>and fully the app version in Chart.yaml]
    E["CI: Ensure that chart version<br>is pre-release (-SNAPSHOT) if no git tag exists"]
    F[CI: Ensure that a chart with the same version<br>is not released]
    H[CI: Lint and package Helm chart]
    I[CI: Download index.yaml from<br>the elastic Helm registry and<br>add the new chart entry]
    L[CI: Push Helm chart package<br>to repository]
    K[CI: Push updated index.yaml<br>to repository]

    B -.->|Pull Request with Chart.yaml changes| D
    D --> E
    E --> F
    F --> H
    H -.->|"''release'' comment under the PR (maybe PR merged?!)"| I
    I --> L
    L --> K

If the above proposal gets accepted, we need to figure out the following:

Things to consider:

cc @cmacknz @swiatekm @ycombinator

swiatekm commented 3 weeks ago

@pkoutsovasilis in your example, you have:

appVersion: 8.16.0 # elastic-agent version
version: 8.16.10 # chart version

Does that mean that you want to always align the minor version of the Chart and elastic-agent, but not necessarily the patch version? If so, do you plan to release Chart patch versions without bumping the agent version?

pkoutsovasilis commented 3 weeks ago

@pkoutsovasilis in your example, you have:

appVersion: 8.16.0 # elastic-agent version version: 8.16.10 # chart version Does that mean that you want to always align the minor version of the Chart and elastic-agent, but not necessarily the patch version? If so, do you plan to release Chart patch versions without bumping the agent version?

Yes pretty much; I want to keep the major and minor parts always aligned and the patch unlocked πŸ™‚ One of the requirements for the Helm chart release is to able to backport fixes if required (here) which translates to bumping the chart version but not the agent one. To highlight such a case, imagine that there is an update to the built-in kubernetes integration that needs to get out but the agent version must be the same because the actual agent isn't updated. In the case there is indeed a new agent minor version is released, then again we bump the chart version and we update the app version to the former.

swiatekm commented 3 weeks ago

Looking at your proposed process, I'm wondering if there is even a need to publish Chart snapshot versions. It sounds like for each Chart release, we'll have to update Chart.yaml twice:

That just sounds like a lot of updating for not much gain, especially if we're going to be releasing the Chart more often than agent itself. Could we get away with just keeping the normal version and only bumping on releases?

The other question I have is how often do we actually expect to do patch releases of the Chart. The Otel Helm Chart simply bump the version and release on every change. Would that work for us?

pkoutsovasilis commented 3 weeks ago

Looking at your proposed process, I'm wondering if there is even a need to publish Chart snapshot versions. It sounds like for each Chart release, we'll have to update Chart.yaml twice:

Once to set the version from snapshot to release, say 8.16.3-SNAPSHOT to 8.16.3 And then once to set it back to snapshot: 8.16.3 to 8.16.4-SNAPSHOT That just sounds like a lot of updating for not much gain, especially if we're going to be releasing the Chart more often than agent itself. Could we get away with just keeping the normal version and only bumping on releases?

I see what you mean about the number of updates, but the gain for me is that we can deploy unreleased agent versions from each branch. Thus, we want to release also SNAPSHOT charts, which based on our current branches would be:

The other question I have is how often do we actually expect to do patch releases of the Chart. The Otel Helm Chart simply bump the version and release on every change. Would that work for us?

So if I understand this correctly for each commit on each branch, even when no chart-related files have changed, you propose to bump the Chart version and release it?

cmacknz commented 3 weeks ago

I think we need to consider how most users are going to consume these charts and expect them to work. Having a Helm chart version 8.16.3 when the most recent Elastic stack patch release is 8.16.2 is going to create quite a lot of confusion. I don't think we would be allowed do this because of this confusion and this is main reason why the independent agent release versions never run ahead of the stack version and instead use the +buildYYYYMMDDHHMM qualifier.

I like the idea of having the Helm chart able to release separate from the stack releases, but we need a less misleading version scheme than running ahead of the stack release like this.

Let's say we used major.minor.YYYYMMDDHHMM with YYYYMMDDHHMM being the timestamp we published the chart at to get sortable versions (major.minor.$gitsha could also work but isn't sortable). We need to confirm with our PMs on what a reasonable versioning scheme is before committing to one.

Then the simplest release process I can think of would be something like:

  1. For each active release branch, publish a new chart version on any commit, ideally any time just the chart changes or whenever the agent version changes. Use the most recently released agent version in this chart. So a commit to the 8.15 branch around the time I wrote this comment would produce 8.15.202411052056 with agent version 8.15.3. I think a lot of people would still ask us for an 8.15.3 chart version for people who want to be aligned with the stack for whatever reason, but we can probably automate that too.
  2. For main, any prerelease branches, and active release branches publish a new chart version with the -SNAPSHOT suffix using the matching -SNAPSHOT agent version. This can happen on any commit.

Using major.minor.YYYYMMDDHHMM avoids inventing a new versioning scheme entirely since we already use something like this for independent agent releases.

cmacknz commented 3 weeks ago

The "I don't want to think about this anymore" option that everyone would accept is just hooking this up to the stack release process and following that, with the major con being we need stack releases to push fixes out.

pkoutsovasilis commented 3 weeks ago

For each active release branch, publish a new chart version on any commit, ideally any time just the chart changes or whenever the agent version changes. Use the most recently released agent version in this chart. So a commit to the 8.15 branch around the time I wrote this comment would produce 8.15.202411052056 with agent version 8.15.3. I think a lot of people would still ask us for an 8.15.3 chart version for people who want to be aligned with the stack for whatever reason, but we can probably automate that too.

Unless I am missing something, this is almost the same with my proposal with the difference that instead of a patch version bump we will go with a timestamp. I have no issue, I thought that keeping the patch version to simple integers would cause less confusion but if you say otherwise no objection on my end πŸ™‚

For main, any prerelease branches, and active release branches publish a new chart version with the -SNAPSHOT suffix using the matching -SNAPSHOT agent version. This can happen on any commit.

My 2 cents here, since there will be a pipeline that filters per chart-related or agent version changes (for normal releases), there is no overhead of utilising the same here but release a chart with -SNAPSHOT

The "I don't want to think about this anymore" option that everyone would accept is just hooking this up to the stack release process and following that, with the major con being we need stack releases to push fixes out.

ha! πŸ˜„ I am not there yet but if there a need to go with that, sure let's go?!

PS: who should we tag from Product to help decide about the most appropriate version scheme?

cmacknz commented 3 weeks ago

I think we're aligned then, I wrote the process out again to make sure I understood.

How would we handle release notes?

PS: who should we tag from Product to help decide about the most appropriate version scheme?

@strawgate @nimarezainia see https://github.com/elastic/elastic-agent/issues/5486#issuecomment-2458146953

pkoutsovasilis commented 3 weeks ago

How would we handle release notes?

I considered using the existing fragments approach with component: elastic-agent-helm-chart, but I haven’t fully explored it yet, so I’m open to other suggestions πŸ™‚

cmacknz commented 3 weeks ago

That works, but we need to define how, when, and where they get published.

Our existing release notes are in https://github.com/elastic/ingest-docs/tree/main/docs/en/ingest-management/release-notes but they are structured around the stack release.

The current path for the Helm chart would have releases happen outside of that so we'd need to handle this in some way.

nimarezainia commented 2 weeks ago

It may be simpler to have a release of the helm chart for each active minor to eliminate this problem. If we want to account for package changes, we'd need a versioning scheme that can move faster than the stack release. We couldn't just have 8.16.1, 8.16.2, etc.

I understand that this is difficult given that we have stack releases and potential package releases. Can I propose that we version and release based on each minor and patch release (aligned with stack) and not complicate it by having daily or snapshot releases or releases based on changes to the package.

I think our patch releases are frequent enough to accommodate (or shorten the window) package releases that may happen between each patch release. As @strawgate also mentioned, if there's a need we can always release a modified helm as an emergency.

pkoutsovasilis commented 2 weeks ago

@nimarezainia just to check if I my understanding is correct, you propose the Helm chart releases to happen alongside the stack ones because the frequency of the patch releases is frequent enough and thus we can keep the both version locked and fully synced?

pkoutsovasilis commented 1 week ago

Given that the frequency of stack patch version releases is frequent enough, keeping the Helm chart and stack versions fully synced indeed makes sense, and I believe it simplifies the overall process. To implement this, I think we should extend the existing release process to include or invoke an additional step for the Helm chart release. We may need to reach out to the release team under platform engineering productivity (?!) to coordinate this effort. (cc @nimarezainia @ycombinator)

Regarding documentation, I propose utilising the existing fragments mechanism for the changelog, with the parent folder located at deploy/helm/elastic-agent/changelog. As for the public URL path where the changelog will be hosted, it’s not entirely clear yet. However, as far as I know, an effort will start soon to create more structured documentation for the Helm chart, and I suspect this aspect will be addressed within that initiative. (cc @eedugon @kilfoyle @karenzone)

Any thoughts on this approach?

ycombinator commented 1 week ago

@pkoutsovasilis Your approach sounds reasonable to me.

kilfoyle commented 1 week ago

This sounds great to me too. I gather in the docs we'd just have a link to the changelog file, which will minimize any maintenance effort.