[Elastic Agent] Allow the binary upgrade action to also rollback the agent version to the previous working version/state

nimarezainia commented 2 years ago

Relates https://github.com/elastic/elastic-agent/issues/5543

Describe the enhancement: Today we have the option to upgrade the agents to a version greater than what the agent is at (as long as it's still below the stack version). However we have never developed/tested downgrading the agent to the previous working state.

This capability is important for many enterprises as they always will require some mitigation from failure.

Describe a specific use case for the enhancement or feature:

Consider the following:

Stack is upgraded from 8.11 to 8.13.2
The operator then starts upgrading the agents from 8.11 to 8.13.2
Once upgraded the agent misbehaves (due to a bug) and renders itself useless
Operator has no choice but to uninstall and re-install back to 8.11

In this case we want the operator to have the ability to "downgrade" the agent to the previous known (good) state via fleet UI.

[ ] https://github.com/elastic/elastic-agent/issues/4606

Now in some catastrophic error situation where the agent has lost connectivity we may not be able to perform this action.

Design

https://github.com/elastic/platform-ux-team/issues/299 (out dated design)

nimarezainia commented 1 year ago

https://github.com/elastic/kibana/issues/131801

pierrehilbert commented 11 months ago

Hey @nimarezainia, As discussed earlier today, according to me the work needs mostly to be done on Fleet side as we can already perform an upgrade to a previous version on the Agent side. cc: @cmacknz to keep me honest here.

cmacknz commented 11 months ago

As discussed earlier today, according to me the work needs mostly to be done on Fleet side as we can already perform an upgrade to a previous version on the Agent side.

The agent will "upgrade" to a lower version using the same logic it would use to upgrade to a newer version. The agent doesn't look at the version at all to determine what to do. So for the most part it works. I do think we need to add tests specifically for downgrades.

The new concern this introduces for the agent is the need for API compatibility when downgrading. We have made and will continue to make changes to the upgrade process, and until now have only been testing the forwards direction.

It would help if we could limit what versions we support downgrading to. For example if we can only support downgrading within the same minor or the the previous minor that would help. For example 8.11.1 can downgrade to 8.11.0 or 8.10.x. That would give us a guaranteed path to phase out broken or obsolete upgrade functionality.

Likely it is desirable to support downgrading from 8.x.x to 7.17.x but 7.17.x has diverged so far from the current agent architecture that might be hard to do.

I don't want us to support arbitrary downgrades. I don't want us to have to guarantee that downgrading from 8.11.x to 8.3.x will work. Primarily because there are no tests guaranteeing that works today, and it means we have to keep any downgrade compatibility logic around indefinitely.

nimarezainia commented 11 months ago

Transferring this to Kibana. fyi @kpollich @jen-huang

Definitely the downgrade should not be a free "to any version" downgrade. So we will document that downgrade can only happen to a certain past range of releases.

However here we are talking about "rollback", that means going back to the version the agent was at before the upgrade. Because the user knows that is a stable version in their specific deployment environment. Also a less chance of issues compared to downgrading to some random lower version.

So I do think that the list of possible downgrades should include the previous version. This becomes harder to cater for since it's really a variable when looking at a group of agents being upgraded/downgraded.

Is it possible for the action to be "downgrade to the last version" and the agent decides what that would be? Can the last version of the agent be stored somewhere locally and referenced in this case?

jlind23 commented 11 months ago

@nimarezainia and I chatted this morning and this is not ready to be worked on. A couple of questions that needs to be answered:

How many version a user can downgrade? Should it be to the previous patch/the previous minor?
Should we allow him to downgrade only to the release that was installed before? If yes, then we need to store it somewhere.
What happened if Elastic Agent introduces a "breaking" change in how it is installed? Different folders, permissions, ..?

Do we really want to support such case? My feeling is that we should always look forward and the early agent release would be my preferred option.

As stated by @pierrehilbert above, this is already supported for Standalone Agens.

pierrehilbert commented 11 months ago

It's also working for managed agent if we are triggering this from the CLI and not Fleet UI.

cmacknz commented 11 months ago

Do we really want to support such case? My feeling is that we should always look forward and the early agent release would be my preferred option.

I agree this is simpler but the difference with always moving forward is there is always a delay to wait for a new release to be published. With a downgrade the user can fix the problem immediately themselves.

Should we allow him to downgrade only to the release that was installed before? If yes, then we need to store it somewhere.

Allowing downgrade to whatever was installed before requires us to support downgrades to arbitrary versions. We have many released versions today with no restrictions on what version they can upgrade to. A user upgrading a mixed version Fleet for example 7.17.x, 8.3.0, and 8.6.0 to the latest 8.11.2 would need to be able to downgrade from 8.11.2 to any of those previous versions.

That said I can't think of a reason why this wouldn't work, but we don't actively test this today.

What happened if Elastic Agent introduces a "breaking" change in how it is installed? Different folders, permissions, ..?

We can't make breaking changes is the answer to this. We must only make changes that allow us to successfully downgrade.

cmacknz commented 11 months ago

That said I can't think of a reason why this wouldn't work, but we don't actively test this today.

Thinking a bit more, I think we may only be able to say we support this for versions that include https://github.com/elastic/elastic-agent/issues/2873 which went into 8.10.0. https://github.com/elastic/elastic-agent/commit/b2b67bc8861e2133bea16349e200cabae6747080

The API contract between the agent and the upgrade watcher is the most important one, and as of 8.10.0 we invoke the watch command from the agent version we are upgrading or downgrading to. This means it knows how to talk to the agent because it is from the same version.

elasticmachine commented 11 months ago

Pinging @elastic/fleet (Team:Fleet)

nimarezainia commented 11 months ago

@jlind23 @pierrehilbert, @cmacknz and I spoke about this issue and we would like to split this into two separate efforts.

Path 1) We just faced a catastrophic issue after users had upgraded to the latest version available. They were hamstrung and stuck there. until Elastic could get a fix out. Path 1 is to address this type of emergency and enable our users to be able to get out of the latest version they are on. This means downgrade to anything they could and wait until Elastic releases a fixed version. This may just need Fleet UI changes and testing of the downgrade.

Path 2) Rollback of the Agent binary. Basically what the enterprise users need and that is, for them to get back to the version that they were on. This will be a bit more challenging to achieve.

I'd say based on recent experience Path 1 has more of an urgency behind it. Let me know if you agree, if so we can use this issue for the downgrade option and pursue a new one for the rollback.

(fyi @kpollich & @jen-huang )

lucabelluccini commented 11 months ago

This may just need Fleet UI changes and testing of the downgrade.

The critical point is Fleet needs a compatibility check, plus we need to make sure integrations being used in the policies in use do not require specific Elastic Agent versions (at the moment the manifest declares the required Kibana version, not the Elastic Agent version). Plus the Fleet protocol must be also compatible. The testing would also be quite extensive as we cannot just test Elastic Agent downgrade, but also ensure integrations can downgrade too.

If we're opening the downgrade from Fleet UI without all the safety nets above, we should restrict its use only as "last resort", for situations like hitting a bug just after upgrading to a new release and switching back to a "closer release". There are risks like having other side-effect problems which might generate more harm than benefits. Examples:

changes of protocol
changes of registry format

jlind23 commented 11 months ago

The testing would also be quite extensive as we cannot just test Elastic Agent downgrade, but also ensure integrations can downgrade too.

And also that every other component can downgrade, including endpoint.

If we're opening the downgrade from Fleet UI without all the safety nets above, we should restrict its use only as "last resort", for situations like hitting a bug just after upgrading to a new release and switching back to a "closer release". There are risks like having other side-effect problems which might generate more harm than benefits.

I feel like this should always be the last resort as we can't reasonably check all the tests cases here even in an automated fashion.

cmacknz commented 11 months ago

What I mentioned speaking with Nima would be that we should have an explicit and limited compatibility guarantee for downgrades. For example you can always go from version N to version N-1. You cannot downgrade to any arbitrary version.

To use our recent example of the memory leak on Windows. It took us several days to track down the source of the memory leak. While being able to release agent on demand would have helped get the fix out faster, there is still a several day period where users are trapped with a fatal bug.

All we needed to do in this case was provide a path back to 8.10.4 and that would have immediately fixed the problem.

Guaranteeing the ability to go back a single minor version is something we can build tests around and more deterministically guarantee will work. We may find some bugs in this process that need to be fixed so we can only make it possible starting from specific versions, for example I strongly suspect this would only work starting for 8.10.x.

In theory if we provide this a user can go back an arbitrary number of versions through repeated downgrades if it came to it.

What I don't want to have to do is allow downgrading from any version to any other arbitrary version, for example from 8.11.2 back to 8.3. There have been too many changes and forcing us to be eternally backwards compatibility limits the improvements we can make to the upgrade process and the agent in general.

jlind23 commented 11 months ago

@cmacknz agree with this statement. We can probably start by testing 8.12.X rollback to 8.11.X and then claim it as supported only from 8.12 onwards. One problem though is that nothing will prevent them from doing 8.13 to 8.12 then 8.12 to 8.11 etc.. So I guess the path forward would be:

Adding a integration test / a manual test that rollback an installed Elastic Agent to the previous minor.patch available
Fix any problem that arise along the way
Give this ability in fleet

For the fleet part, I am unsure if whether or not we want to provide a bulk rollback option though.

nimarezainia commented 10 months ago

@jlind23 @cmacknz Would you be able to summarize what the integration test would do? I am assuming that testing of integrations packages downgrading is not a part of that.

In the recent memory leak issue, N-1 wouldn't have helped as we had two versions that were affected. This exacerbated if it takes longer for the problem to actually surface, which was the case here.

jlind23 commented 10 months ago

I am wondering if we should move our project creation to use that tool, or if we just should add a similar step in our current project creation to do an api call to github to retrieve that version.

Install a managed agent in version X and check that system integration is working properly
Upgrade it to version X+1 and check that system integration is working properly
Downgrade it to version X and check that system integration is working properly

As a first step, the downgrade can be executed via CLI and not via the UI.

cmacknz commented 10 months ago

In the recent memory leak issue, N-1 wouldn't have helped as we had two versions that were affected. This exacerbated if it takes longer for the problem to actually surface, which was the case here.

The N in N-1 is the minor version not the patch version. So my intent would be to allow going from any 8.11.x back to any 8.10.x would would have helped. You are correct that going from 8.11.1 to 8.11.0 would not have helped.

Install a managed agent in version X and check that system integration is working properly Upgrade it to version X+1 and check that system integration is working properly Downgrade it to version X and check that system integration is working properly

Yes essentially this but the definition of "working properly" is important. We would need to check to see if there is log data loss or duplication specifically.

cmacknz commented 10 months ago

We also need to confirm that the other teams that develop for agent support downgrading, and if they don't rather than not doing this we can make supporting it conditional on the running integrations. Endpoint is the biggest question.

@ferullo @nfritts does endpoint support downgrading from one versions to another? We are looking to add support for downgrading agent, but want to constraint so that we only allow going from the current version to the previous minor version instead of to any arbitrary version. So downgrading from 8.11.x to 8.10.x is allowed, but not 8.11.x to 8.9.x. This limits the scope of what we need to test and serves as the backwards compatibility contract.

As discussed earlier, even with independent agent releases users still need to wait for us to find and fix the problem. Allowing a downgrade to the previous version allows an immediate work around.

nfritts commented 10 months ago

@cmacknz I'd want to do some actual testing with it, but overall, we designed the endpoint install so that downgrades would be supported. I don't think we've changed anything that would make it a problem since then.

nimarezainia commented 10 months ago

Install a managed agent in version X and check that system integration is working properly Upgrade it to version X+1 and check that system integration is working properly Downgrade it to version X and check that system integration is working properly

Yes essentially this but the definition of "working properly" is important. We would need to check to see if there is log data loss or duplication specifically.

There's another variable here which is the version of the System Integration (or any integration). Which version of the package would be tested in these scenarios? This test would probably need to run over multiple package versions. It's ok to assume that the version of the integration doesn't change when testing X and X+1

I would also say that we would need to test with more than just System Integration and have some of the other top integrations currently in use, such as Defend and NGINX, to increase the level of confidence.

cmacknz commented 10 months ago

There is nothing forcing alignment between package version and agent version right now, any integration package version can be used with any agent version. The constraint is on the Kibana version only. So I don't expect the integration version to have much effect.

The important thing we need to test are inputs that store local state, because we need to make sure those inputs can handle going backwards properly without losing or re-ingesting data. Filebeat log ingestion is in this category.

We'll need to reach out to each team that developers inputs for agent to get confirmation on whether or not they can support this.

jlind23 commented 10 months ago

I probably know the answer but asking anyway - @nimarezainia is data duplication an issue in that case or is this a known limitation we can document?

nimarezainia commented 10 months ago

i don;t know that it's a known issue but we could certainly document it. Since this would be a rare event (the downgrade) that I would say duplication may be tolerated. it will also depend on the scale of the duplication (which I am not certain we can determine accurately).

cmacknz commented 10 months ago

Issue on the agent we'd want to implement before rolling this out https://github.com/elastic/elastic-agent/issues/4072

This follows from some changes to the directory structure we are making for the independent releases project.

nimarezainia commented 6 months ago

All the pre-requisites for this issue have now been resolved. Changing the priority so that we have this capability in the Fleet UI.

cmacknz commented 6 months ago

Added https://github.com/elastic/elastic-agent/issues/4606 on the agent side so we can guarantee this will continue to work.

Initially I think we should only support going to the previous minor as that is easy to test and guarantee.

nimarezainia commented 6 months ago

+1 to limiting the range. So if the user upgrades to 8.13.2 and needs to downgrade, are we saying they can only downgrade to 8.12.0? or all binaries available in 8.12.x

Obviously this would be supported from whenever this issue is merged (say only supported from 8.15 onwards) with the associated testing to the lower version. As we move forward, the supported downgrade version also moves forward. As mentioned the user can do multiple step downgrades.

cmacknz commented 6 months ago

are we saying they can only downgrade to 8.12.0? or all binaries available in 8.12.x

Any 8.12.x would be valid. There would not be structural changes to the agent internals that prevent upgrades anywhere within the same minor.

"Any version can downgrade to any other agent version" is not currently a requirement for Elastic Agent. We could introduce this and enforce it starting from a specific point in time, but nobody was testing this back in 7.17.x or 8.3.x or whatever. If we support arbitrary backwards compatibility we are also limited in the changes we can make to the agent architecture.

The agent has an automatic rollback built into its upgrade process but it depends on keeping both versions of the agent on disk at the same time, with two copies of all of the local state so we can just revert back.

Allowing downgrades outside of the existing upgrade process means the local copies of the previous version's internal state are gone and the old version needs to know how to reconstruct them from the newer version, which for many old versions won't work at all and there's nothing we can do to change that now.

nimarezainia commented 6 months ago

Modified the description for this feature. keeping it in sp30.

kpollich commented 5 months ago

Moving this to blocked until the integration tests are in place. I don't think the lift on the UI side will be heavy, but I don't want to take on the work until we have test coverage on the agent side.

Blocked by https://github.com/elastic/elastic-agent/issues/4606

strawgate commented 4 months ago

If we allow users to configure how long the old version remains on disk, can we just have the downgrade be to the archived agent install and state?

So you wouldn't be downgrading to a specific version, you'd just be downgrading to the previous version and state available on disk?

Perhaps a setting when the user schedules the upgrade called like, "Downgrade window", or "Previous Agent snapshot retention" with a description which lets the user know that reverting may cause ingestion of the data in the window again?

Would this allow us to avoid the "can only downgrade starting with 8.15"?

cmacknz commented 4 months ago

Would this allow us to avoid the "can only downgrade starting with 8.15"?

We'd still only be able to support the longer agent snapshot retention from the release where we introduced that configuration, but it would work for any agent version combination.

One caveat to the approach of allowing the old agent to be retained for longer and then returning to it is that they would also return their data collection state to that point in time. For example, if someone had accumulated one week's worth of log files on disk before rolling back back like this, they would re-ingest that one week's worth of data.

That approach does eliminate any pain around data format migrations though, assuming there are no integrations storing state outside of the agent directory.

nimarezainia commented 4 months ago

I think we can caveat the data duplication concern here. The benefit outweighs that concern IMO. Ideally the user first and foremost would want to restore the old working state. Since this will only be invoked on rarest of occasions, data duplication may not be that big a deal.

cmacknz commented 4 months ago

I started writing an issue for the agent side implementation of this, but it's complicated enough that it needs to start as a small RFC. I will send that out soon.

nimarezainia commented 1 month ago

Modified the description to reflect the fact that we would be essentially reverting the upgrade to the original version/state.

markniemeijer commented 4 days ago

+100 if this goes live :)

elastic / kibana

[Elastic Agent] Allow the binary upgrade action to also rollback the agent version to the previous working version/state #172745

Design