elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 133 forks source link

[Discuss] Elastic Agent Helm Chart release process #5486

Open ycombinator opened 1 week ago

ycombinator commented 1 week ago

In https://github.com/elastic/elastic-agent/pull/5331, we added a Helm Chart for Elastic Agent. At the time, we deferred the discussion on how this chart should be released. Let's continue the discussion in this issue and try to arrive at a decision.

ycombinator commented 1 week ago

cc: @pkoutsovasilis @swiatekm @blakerouse @cmacknz

swiatekm commented 6 days ago

As I see it, we have two basic choices here:

  1. Release a Helm Chart version every time we release an agent version. The default agent image version is set to the Chart version.
  2. Release on a different schedule. There's either no default agent image (user needs to set it themself), or there's a default, but it may not be the latest agent image.

The important questions that should guide this choice are:

  1. How often do we expect to have semantic changes to the Chart, not related to the agent version?
  2. Are we ok with the user having to set the agent image? a) If we want a default, are we ok with the latest Chart version not having the latest agent version as the default? b) Are we confident in users' ability to upgrade the agent image independently of the Helm Chart version?
  3. Is it important for users to be able to use the Chart with agent versions older than the first Chart release?
  4. Is it a pain to release the Chart in lockstep with agent?

I don't feel like I know enough about the domain to have informed opinions about these myself. I think I'd default to option 1 just because it's simple, but it does sacrifice some flexibility.

pkoutsovasilis commented 6 days ago

I like in general the first choice that @swiatekm proposes above, but I would like to add a twist to it

Before elaborating more I will try to provide some answer to the raised questions

  1. How often do we expect to have semantic changes to the Chart, not related to the agent version?

Changes to the Helm chart can be driven by the following:

  1. Are we ok with the user having to set the agent image?

I would prefer to have as default one the one that the helm chart was tested against

  1. Is it important for users to be able to use the Chart with agent versions older than the first Chart release?

imo the helm chart should support the user specifying different agent versions/images but provide guarantees of nominal operation only with the default image that comes with it. Hence, following this guarantee, there will be chart versions that will support only a specific agent version and either the user needs to stay in an older version if they want to remain under an agent version or change with their own responsibility

  1. Is it a pain to release the Chart in lockstep with agent?

I wouldn't expect this to be more painless that a lockfree release with agent (last famous words of me)

My approach would be that the Helm chart releases can happen only from the main branch. Specifically we can fabricate a versions.json file inside the helm chart that captures all the required release info, namely chart_version, agent_version, etc. Then we can add the following CI steps:

The above is just a draft of an my initial thinking more than happy to drill down the details if it sounds reasonable to all the interested parties

swiatekm commented 6 days ago

@pkoutsovasilis do you mean literally just main, or also the release branches? If we don't do the latter, we won't have Helm Chart releases for agent patch versions.

pkoutsovasilis commented 6 days ago

@swiatekm I was thinking main but sure this can be adjusted to any branch if we deem such a wiser choice. The way I am thinking of it

imo the helm chart should support the user specifying different agent versions/images but provide guarantees of nominal operation only with the default image that comes with it

the helm chart will support agent versions incrementally thus when there is a patch version to the latest agent release I see no issue creating a PR that points the aforementioned version.json to the patch version and triggering all the necessary CI steps. But maybe I am missing something, so please do elaborate 😃

cmacknz commented 6 days ago

Changes to the built-in kubernetes integration as it mostly reflects this package

How do we account for and include changes to the package? Automation that creates an appropriate PR in this repository that eventually leads to a new chart release?

I like the idea of just releasing from main as it is simpler, but I worry we'd need a way to indicate which agent versions new features are compatible with eventually like the stack version constraint in integrations packages.

It may be simpler to have a release of the helm chart for each active minor to eliminate this problem. If we want to account for package changes, we'd need a versioning scheme that can move faster than the stack release. We couldn't just have 8.16.1, 8.16.2, etc.

pkoutsovasilis commented 6 days ago

How do we account for and include changes to the package? Automation that creates an appropriate PR in this repository that eventually leads to a new chart release?

at the moment this is a manual process so the accounting happens manually. That said, if we deem that this becomes "unbearable" we could try and automate it; ps: transitioning from handlebars to helm templates is non-trivial 😄 One of the reasons this Helm chart started with only the kubernetes integration built-in is this one

I like the idea of just releasing from main as it is simpler, but I worry we'd need a way to indicate which agent versions new features are compatible with eventually like the stack version constraint in integrations packages. It may be simpler to have a release of the helm chart for each active minor to eliminate this problem. If we want to account for package changes, we'd need a versioning scheme that can move faster than the stack release. We couldn't just have 8.16.1, 8.16.2, etc.

In my thinking the helm chart provides guarantees only for the combination of agent_version and kubernetes_integration_version it comes with when it is released. In simpler terms, if the user wants to mix match different agent versions and the kubernetes integration, it is up to them to do so and they should follow the route of custom integration and not rely in the built-in one. Because the latter is guaranteed to work correctly only with the agent version that the released chart comes with. However, if you are not in favour of this, then I guess we need to alter this approach.

cmacknz commented 6 days ago

In my thinking the helm chart provides guarantees only for the combination of agent_version and kubernetes_integration_version it comes with when it is released.

I think requiring an agent upgrade to get a guarantee for a new kubernetes integration version goes against the way we want package releases to work where you don't have to upgrade agent outside of this scenario.

Often users cannot rapidly upgrade their Elastic Agent (because they have to upgrade ES, they have rigid compliance or QA procedures, etc) but they can upgrade their agent configuration or policy quickly.

Whatever we decide to do has to support upgrading the k8s integration without upgrading the agent version, I think that will be quite a common ask.

Users of this chart will frequently be k8s and Helm users but not k8s and Helm (or Elastic Agent) experts, we need to optimize for helping them get fixes with minimal toil or the need to escalate to engineering via support cases.

pkoutsovasilis commented 6 days ago

I think requiring an agent upgrade to get a guarantee for a new kubernetes integration version goes against the way we want package releases to work where you don't have to upgrade agent outside of this scenario.

Often users cannot rapidly upgrade their Elastic Agent (because they have to upgrade ES, they have rigid compliance or QA procedures, etc) but they can upgrade their agent configuration or policy quickly.

Whatever we decide to do has to support upgrading the k8s integration without upgrading the agent version, I think that will be quite a common ask.

Users of this chart will frequently be k8s and Helm users but not k8s and Helm (or Elastic Agent) experts, we need to optimize for helping them get fixes with minimal toil or the need to escalate to engineering via support cases.

ok I hear you 🙂 Then let's do a minor alternation to the approach; helm chart releases happen from 8.x, 9.x branches, still following the same versions.json paradigm with CI steps, I mentioned above, but each one of these branches target for release a different minor version of the helm chart. As an example 8.16 branch of the agent would be the 0.1.1 helm chart version. Following this pattern, the 9.0 agent branch would be the 0.2.1 helm chart, etc. (speaking random semvers here 😅). I think this approach covers the cases you mention @cmacknz , right?

cmacknz commented 6 days ago

Yes that would work and let us backport fixes where applicable.

pkoutsovasilis commented 6 days ago

ok then my next step will be to draw this "flow" in a diagram and thus help all the interested parties to have a better understanding how this is gonna look like. Then if there are no strong objections I will proceed to coding it