elastic / fleet-server

The Fleet server allows managing a fleet of Elastic Agents.
Other
83 stars 81 forks source link

Agent `upgraded_at` field keeps updating to current time #3263

Closed juliaElastic closed 8 months ago

juliaElastic commented 8 months ago

Stack version 8.12.1 and possibly others.

There seems to be an issue of Agents upgraded_at field keep being updated to current time, and this results in Fleet UI now showing Upgrade available when it should, and Upgrade agent action being disabled, because Fleet UI doesn't consider agent upgradeable if the agent was updated in the last 10 minutes.

It's not clear yet if the issue is on fleet-server or agent side.

Reproduced on a fresh 8.12.1 cluster, by enrolling a 8.11.4 agent, upgrade to 8.12.0 and wait 10 minutes. The agent is still not allowed to be upgraded again to 8.12.1, and the upgraded_at field looks recent, event though the last upgrade happened more than 10m ago.

Workaround:

juliaElastic commented 8 months ago

I found the bug, upgrade_details is set to null when the upgrade is complete, and the logic looks at len(agent.UpgradeDetails to decide if the previous agent doc had upgrade_details, which evaluates to true for null (len is 4), and so sets upgraded_at to now at every checkin.

jlind23 commented 8 months ago

@juliaElastic does it mean a newly install agent on 8.12.0 would have successfully be upgraded to 8.12.1? @amolnater-qasource do you have a scenario where an agent of the previous minor 8.11 is upgrading to the next minor and then all the patches? 8.12.0 then 8.12.1? @kpollich @juliaElastic is there any way for us to automatically test this?

kpollich commented 8 months ago

does it mean a newly install agent on 8.12.0 would have successfully be upgraded to 8.12.1?

An agent on 8.12.0 cannot be upgraded to 8.12.1 via Fleet UI currently without the workaround Julia drafted here: https://github.com/elastic/fleet-server/pull/3264#issuecomment-1936216485.

is there any way for us to automatically test this?

We need an automated test on all release branches where an upgrade from an agent on the latest available patch release for that branch is upgraded to the build for the current HEAD of that release branch. e.g. on the 8.12 branch, we'd run an upgrade for an agent running the released 8.12.0 agent binary to the current 8.12.0-SNAPSHOT build built off the release branch.

Additionally, we could have a daily run that does the same using the daily snapshot build instead of a PR build.

jlind23 commented 8 months ago

@pierrehilbert @cmacknz regarding Kyle's comment above isn't this something we test already in the elastic agent testing framework?

amolnater-qasource commented 8 months ago

Hi @jlind23

@amolnater-qasource do you have a scenario where an agent of the previous minor 8.11 is upgrading to the next minor and then all the patches? 8.12.0 then 8.12.1?

We don't have the documented testcase for this scenario and cover this as a part of exploratory testing. Please let us know if should create a testcase for this.

Testing details: While testing on BC build, example: 8.12.1 if we have to upgrade 8.12.0 agent we had to trigger the same using the API(as of now) from Dev tools as it doesn't show upgrade available till the time it(8.12.1) is not released.

We weren't able to directly upgrade using Fleet UI till 8.12.1 BC1 from just previous version(8.12.0).

Thanks!

jlind23 commented 8 months ago

We weren't able to directly upgrade using Fleet UI till 8.12.1 BC1 from just previous version(8.12.0).

@amolnater-qasource but once you were able to test 8.12.0 to 8.12.1 it worked right?

amolnater-qasource commented 8 months ago

@jlind23 We have revalidated on the released 8.12.1 and observed that we are not able to upgrade from the UI.

Screenshots/Recordings:

https://github.com/elastic/fleet-server/assets/77374876/e366ef2f-2008-42d5-8285-7badd4e170cf

Please let us know if anything else is required from our end. Thanks!!

kpollich commented 8 months ago

While testing on BC build, example: 8.12.1 if we have to upgrade 8.12.0 agent we had to trigger the same using the API(as of now) from Dev tools as it doesn't show upgrade available till the time it(8.12.1) is not released.

Thanks @amolnater-qasource this makes sense as the next patch release isn't published during the BC phase and thus won't be shown in Fleet UI (maybe something we can file an enhancement for). To clarify: was upgrading from 8.12.0 -> 8.12.1 via the API successful during the BC test? If so can you share a summary of the test steps used as well? My Testrail access has lapsed as I don't log in frequently 🙃, otherwise I would check myself. Many thanks.

cmacknz commented 8 months ago

@pierrehilbert @cmacknz regarding Kyle's comment above isn't this something we test already in the elastic agent testing framework?

We test a single upgrade, that is we install an agent build from the head of the current branch and upgrade it to the latest snapshot in that branch. This would be 8.13.0-SNAPSHOT or 8.12.0-SNAPSHOT for main and 8.12 respectively.

We don't test two consecutive upgrades because from the agent's perspective there is no reason to, once the agent completes the upgrade state machine reported in the upgrade details it can upgrade again. There is no other state in the agent that can prevent this.

jlind23 commented 8 months ago

@amolnater-qasource We looked at testrail with @kpollich and it looks like the test case below does not exist:

We are not able to upgrade from 8.12.0>8.12.1 from bulk actions too.

Can we make sure this is added please?

kpollich commented 8 months ago

We test a single upgrade, that is we install an agent build from the head of the current branch and upgrade it to the latest snapshot in that branch. This would be 8.13.0-SNAPSHOT or 8.12.0-SNAPSHOT for main and 8.12 respectively.

We don't test two consecutive upgrades because from the agent's perspective there is no reason to, once the agent completes the upgrade state machine reported in the upgrade details it can upgrade again. There is no other state in the agent that can prevent this.

Thanks, Craig. I think the agent test coverage is sufficient here and consecutive updates aren't something we should pursue adding. The coverage gaps lie elsewhere in Fleet.

amolnater-qasource commented 8 months ago

To clarify: was upgrading from 8.12.0 -> 8.12.1 via the API successful during the BC test?

@kpollich yes the direct 8.12.0>8.12.1 BC1 was successful, which is part of our testcases using below API under the Dev tools:

POST kbn:/api/fleet/agents/<agent-id>/upgrade
{
  "version": "8.12.1"
}

Further, even on 8.12.1 released kibana environment, we are successfully able to upgrade 8.12.0>8.12.1 from Fleet UI.

Screen Recording:

https://github.com/elastic/fleet-server/assets/77374876/39c87f49-46bd-41fa-915d-52dc645cbfc5

If so can you share a summary of the test steps used as well?

We do not have any testcase for upgrading the agents twice like from 8.11.4> 8.12.0> 8.12.1

However, we have testcases from one lower version from all OS's:

Please let us know if anything else is required from our end.

cc: @jlind23 Thanks!

jlind23 commented 8 months ago

@kpollich @juliaElastic according to https://github.com/elastic/fleet-server/issues/3263#issuecomment-1938845698 it means that fresh install on 8.12.0 can be upgraded to 8.12.1, is that expected?

kpollich commented 8 months ago

Thanks @amolnater-qasource - this is extremely helpful in understanding our existing test coverage here.

according to https://github.com/elastic/fleet-server/issues/3263#issuecomment-1938845698 it means that fresh install on 8.12.0 can be upgraded to 8.12.1, is that expected?

I can confirm this is working as expected. I created a fresh 8.12.1 cloud instance (which naturally deploys a Fleet Server on 8.12.1 as well), then enrolled an agent running on 8.12.0, find my observations below:

  1. Upgrade available filter behaves as expected immediately following enrollment:

image

  1. "Upgrade" action is available, and triggering and upgrade immediately applies to the agent as expected

image

  1. Granular upgrade state is reported as expected

image

  1. Upgrade completes as expected

image


So, if I'm understanding the smoke tests properly, we wouldn't have caught this issue in smoke tests. In our smoke tests, we create a cloud instance on the latest release, then enroll an agent on the previous release, then attempt to upgrade it. To confirm this, I performed the same steps as above, but initially enrolled an agent on 8.11.4, then upgraded that agent to 8.12.0 instead of starting with a fresh 8.12.0 agent. Observations below:

  1. Agent enrolls successfully, shows Upgrade available badge as expected when on 8.11.4

image

  1. Agent is upgradeable to 8.12.0 and 8.12.1, updates to 8.12.0 as expected

image

image

  1. Agent is not upgradeable to 8.12.1, even after waiting the requisite 10 minutes for the upgrade rate limit - this is the bug described in this issue

image


So, in order to catch this bug in the QAS smoke tests, we would've needed to test a sequential upgrade from 8.11.4 -> 8.12.0 -> 8.12.1, or in generic terms Previous minor's latest patch -> Current minor -> Current minor's latest patch. Codifying this into a regression test seems like a good idea, but it's hard decide what the test case should be. When we go to test 8.12.2 should the sequential upgrade test case be 8.11.4 -> 8.12.0 -> 8.12.2 or 8.11.4 -> 8.12.1 -> 8.12.2?

cmacknz commented 8 months ago

When we go to test 8.12.2 should the sequential upgrade test case be 8.11.4 -> 8.12.0 -> 8.12.2 or 8.11.4 -> 8.12.1 -> 8.12.2?

I don't think it matters, either sequence would reproduce the bug wouldn't it? Probably best to always use the latest versions with the most bug fixes to minimize other issues.

This regression test using real agents is a good idea but it also feels like you could write automated test for the upgrade state directly in Fleet server. The simplest version of this would use mock agents (similar to horde) and you could query the resulting changes out of Elasticsearch directly, although a better test would probably use the Fleet API in Kibana.

The agent test framework we use can provide the guarantee that the agent half of the upgrade works as expected, so you don't need to reverify that.

Using mock agents would also allow you to have them do adversarial things like make requests with incorrect and out of order upgrade details. While Fleet shouldn't have to verify the agent part of the contact, it also shouldn't assume the agent will never have a bug in how it talks to Fleet and it should defend itself against that.

kpollich commented 8 months ago

I don't think it matters, either sequence would reproduce the bug wouldn't it? Probably best to always use the latest versions with the most bug fixes to minimize other issues.

Yes I should clarify: either scenario would reproduce this issue, but I meant to codify this process for future test runs. Using the latest versions sounds good to me. We'd codify this in TestRail as follows, to be run on patch releases

Previous minor's latest release -> Current release - 1 patch -> Current release

e.g. 8.11.4 -> 8.12.1 -> 8.12.2

For minors, we'd stick with

Previous minor's latest release -> Current release

e.g. 8.11.4 -> 8.12.0

This regression test using real agents is a good idea but it also feels like you could write automated test for the upgrade state directly in Fleet server. The simplest version of this would use mock agents (similar to horde) and you could query the resulting changes out of Elasticsearch directly, although a better test would probably use the Fleet API in Kibana.

I agree that ultimately this case should be covered in Fleet Server tests. There are substantial barriers to handling this in Kibana CI (we need to spawn "real" agents off of snapshot builds, for example) that don't exist in Fleet Server.

Spawning a live Kibana server in Fleet Server CI is a good idea, but I don't know that we do that today. I know that's how the agent tests we're talking about work, so we also do this in Fleet Server for better test fidelity.

I'm working on capturing all of this in a RCA doc that I'll send out later today, then we'll meet tomorrow as a group to make sure we're aligned on next steps.

amolnater-qasource commented 8 months ago

Hi Team,

We have created 01 testcase for this scenario under our fleet test suite at link:

Please let us know if anything else is required from our end. Thanks!

juliaElastic commented 8 months ago

Tested locally with fleet-server (8.12.0) and agent (8.11.4) enrolled (using multipass VMs).

image image image
amolnater-qasource commented 8 months ago

Hi Team,

We have revalidated this issue on 8.12.2 BC1 kibana cloud environment and had below observations:

Observations:

Logs: elastic-agent-diagnostics-2024-02-21T09-38-41Z-00.zip

Build details: VERSION: 8.12.2 BUILD: 70281 COMMIT: f5bd489c5ff9c676c4f861c42da6ea99ae350832

Hence, we are marking this issue as QA:Validated.

Please let us know if we are missing anything here. Thanks!