Closed juliaElastic closed 9 months ago
I found the bug, upgrade_details
is set to null
when the upgrade is complete, and the logic looks at len(agent.UpgradeDetails
to decide if the previous agent doc had upgrade_details
, which evaluates to true
for null
(len is 4), and so sets upgraded_at
to now
at every checkin.
@juliaElastic does it mean a newly install agent on 8.12.0 would have successfully be upgraded to 8.12.1? @amolnater-qasource do you have a scenario where an agent of the previous minor 8.11 is upgrading to the next minor and then all the patches? 8.12.0 then 8.12.1? @kpollich @juliaElastic is there any way for us to automatically test this?
does it mean a newly install agent on 8.12.0 would have successfully be upgraded to 8.12.1?
An agent on 8.12.0 cannot be upgraded to 8.12.1 via Fleet UI currently without the workaround Julia drafted here: https://github.com/elastic/fleet-server/pull/3264#issuecomment-1936216485.
is there any way for us to automatically test this?
We need an automated test on all release branches where an upgrade from an agent on the latest available patch release for that branch is upgraded to the build for the current HEAD of that release branch. e.g. on the 8.12
branch, we'd run an upgrade for an agent running the released 8.12.0
agent binary to the current 8.12.0-SNAPSHOT
build built off the release branch.
Additionally, we could have a daily run that does the same using the daily snapshot build instead of a PR build.
@pierrehilbert @cmacknz regarding Kyle's comment above isn't this something we test already in the elastic agent testing framework?
Hi @jlind23
@amolnater-qasource do you have a scenario where an agent of the previous minor 8.11 is upgrading to the next minor and then all the patches? 8.12.0 then 8.12.1?
We don't have the documented testcase for this scenario and cover this as a part of exploratory testing. Please let us know if should create a testcase for this.
Testing details: While testing on BC build, example: 8.12.1 if we have to upgrade 8.12.0 agent we had to trigger the same using the API(as of now) from Dev tools as it doesn't show upgrade available till the time it(8.12.1) is not released.
We weren't able to directly upgrade using Fleet UI till 8.12.1 BC1 from just previous version(8.12.0).
Thanks!
We weren't able to directly upgrade using Fleet UI till 8.12.1 BC1 from just previous version(8.12.0).
@amolnater-qasource but once you were able to test 8.12.0 to 8.12.1 it worked right?
@jlind23 We have revalidated on the released 8.12.1 and observed that we are not able to upgrade from the UI.
Screenshots/Recordings:
https://github.com/elastic/fleet-server/assets/77374876/e366ef2f-2008-42d5-8285-7badd4e170cf
Please let us know if anything else is required from our end. Thanks!!
While testing on BC build, example: 8.12.1 if we have to upgrade 8.12.0 agent we had to trigger the same using the API(as of now) from Dev tools as it doesn't show upgrade available till the time it(8.12.1) is not released.
Thanks @amolnater-qasource this makes sense as the next patch release isn't published during the BC phase and thus won't be shown in Fleet UI (maybe something we can file an enhancement for). To clarify: was upgrading from 8.12.0 -> 8.12.1 via the API successful during the BC test? If so can you share a summary of the test steps used as well? My Testrail access has lapsed as I don't log in frequently 🙃, otherwise I would check myself. Many thanks.
@pierrehilbert @cmacknz regarding Kyle's comment above isn't this something we test already in the elastic agent testing framework?
We test a single upgrade, that is we install an agent build from the head of the current branch and upgrade it to the latest snapshot in that branch. This would be 8.13.0-SNAPSHOT or 8.12.0-SNAPSHOT for main and 8.12 respectively.
We don't test two consecutive upgrades because from the agent's perspective there is no reason to, once the agent completes the upgrade state machine reported in the upgrade details it can upgrade again. There is no other state in the agent that can prevent this.
@amolnater-qasource We looked at testrail with @kpollich and it looks like the test case below does not exist:
We are not able to upgrade from 8.12.0>8.12.1 from bulk actions too.
Can we make sure this is added please?
We test a single upgrade, that is we install an agent build from the head of the current branch and upgrade it to the latest snapshot in that branch. This would be 8.13.0-SNAPSHOT or 8.12.0-SNAPSHOT for main and 8.12 respectively.
We don't test two consecutive upgrades because from the agent's perspective there is no reason to, once the agent completes the upgrade state machine reported in the upgrade details it can upgrade again. There is no other state in the agent that can prevent this.
Thanks, Craig. I think the agent test coverage is sufficient here and consecutive updates aren't something we should pursue adding. The coverage gaps lie elsewhere in Fleet.
To clarify: was upgrading from 8.12.0 -> 8.12.1 via the API successful during the BC test?
@kpollich yes the direct 8.12.0>8.12.1 BC1 was successful, which is part of our testcases using below API under the Dev tools:
POST kbn:/api/fleet/agents/<agent-id>/upgrade
{
"version": "8.12.1"
}
Further, even on 8.12.1 released kibana environment, we are successfully able to upgrade 8.12.0>8.12.1 from Fleet UI.
Screen Recording:
https://github.com/elastic/fleet-server/assets/77374876/39c87f49-46bd-41fa-915d-52dc645cbfc5
If so can you share a summary of the test steps used as well?
We do not have any testcase for upgrading the agents twice like from 8.11.4> 8.12.0> 8.12.1
However, we have testcases from one lower version from all OS's:
Please let us know if anything else is required from our end.
cc: @jlind23 Thanks!
@kpollich @juliaElastic according to https://github.com/elastic/fleet-server/issues/3263#issuecomment-1938845698 it means that fresh install on 8.12.0 can be upgraded to 8.12.1, is that expected?
Thanks @amolnater-qasource - this is extremely helpful in understanding our existing test coverage here.
according to https://github.com/elastic/fleet-server/issues/3263#issuecomment-1938845698 it means that fresh install on 8.12.0 can be upgraded to 8.12.1, is that expected?
I can confirm this is working as expected. I created a fresh 8.12.1
cloud instance (which naturally deploys a Fleet Server on 8.12.1
as well), then enrolled an agent running on 8.12.0
, find my observations below:
Upgrade available
filter behaves as expected immediately following enrollment:So, if I'm understanding the smoke tests properly, we wouldn't have caught this issue in smoke tests. In our smoke tests, we create a cloud instance on the latest release, then enroll an agent on the previous release, then attempt to upgrade it. To confirm this, I performed the same steps as above, but initially enrolled an agent on 8.11.4
, then upgraded that agent to 8.12.0
instead of starting with a fresh 8.12.0
agent. Observations below:
Upgrade available
badge as expected when on 8.11.4
8.12.0
and 8.12.1
, updates to 8.12.0
as expectedSo, in order to catch this bug in the QAS smoke tests, we would've needed to test a sequential upgrade from 8.11.4
-> 8.12.0
-> 8.12.1
, or in generic terms Previous minor's latest patch
-> Current minor
-> Current minor's latest patch
. Codifying this into a regression test seems like a good idea, but it's hard decide what the test case should be. When we go to test 8.12.2
should the sequential upgrade test case be 8.11.4 -> 8.12.0 -> 8.12.2
or 8.11.4 -> 8.12.1 -> 8.12.2
?
When we go to test 8.12.2 should the sequential upgrade test case be 8.11.4 -> 8.12.0 -> 8.12.2 or 8.11.4 -> 8.12.1 -> 8.12.2?
I don't think it matters, either sequence would reproduce the bug wouldn't it? Probably best to always use the latest versions with the most bug fixes to minimize other issues.
This regression test using real agents is a good idea but it also feels like you could write automated test for the upgrade state directly in Fleet server. The simplest version of this would use mock agents (similar to horde) and you could query the resulting changes out of Elasticsearch directly, although a better test would probably use the Fleet API in Kibana.
The agent test framework we use can provide the guarantee that the agent half of the upgrade works as expected, so you don't need to reverify that.
Using mock agents would also allow you to have them do adversarial things like make requests with incorrect and out of order upgrade details. While Fleet shouldn't have to verify the agent part of the contact, it also shouldn't assume the agent will never have a bug in how it talks to Fleet and it should defend itself against that.
I don't think it matters, either sequence would reproduce the bug wouldn't it? Probably best to always use the latest versions with the most bug fixes to minimize other issues.
Yes I should clarify: either scenario would reproduce this issue, but I meant to codify this process for future test runs. Using the latest versions sounds good to me. We'd codify this in TestRail as follows, to be run on patch releases
Previous minor's latest release -> Current release - 1 patch -> Current release
e.g. 8.11.4 -> 8.12.1 -> 8.12.2
For minors, we'd stick with
Previous minor's latest release -> Current release
e.g. 8.11.4 -> 8.12.0
This regression test using real agents is a good idea but it also feels like you could write automated test for the upgrade state directly in Fleet server. The simplest version of this would use mock agents (similar to horde) and you could query the resulting changes out of Elasticsearch directly, although a better test would probably use the Fleet API in Kibana.
I agree that ultimately this case should be covered in Fleet Server tests. There are substantial barriers to handling this in Kibana CI (we need to spawn "real" agents off of snapshot builds, for example) that don't exist in Fleet Server.
Spawning a live Kibana server in Fleet Server CI is a good idea, but I don't know that we do that today. I know that's how the agent tests we're talking about work, so we also do this in Fleet Server for better test fidelity.
I'm working on capturing all of this in a RCA doc that I'll send out later today, then we'll meet tomorrow as a group to make sure we're aligned on next steps.
Hi Team,
We have created 01 testcase for this scenario under our fleet test suite at link:
Please let us know if anything else is required from our end. Thanks!
Tested locally with fleet-server (8.12.0) and agent (8.11.4) enrolled (using multipass VMs).
https://snapshots.elastic.co/8.12.2-5f8ffc93/downloads
8.12.2-SNAPSHOT
by manually typing in the version.8.12.2-SNAPSHOT
Upgrade agent
action is enabled for both fleet-server and agentHi Team,
We have revalidated this issue on 8.12.2 BC1 kibana cloud environment and had below observations:
Observations:
Logs: elastic-agent-diagnostics-2024-02-21T09-38-41Z-00.zip
Build details: VERSION: 8.12.2 BUILD: 70281 COMMIT: f5bd489c5ff9c676c4f861c42da6ea99ae350832
Hence, we are marking this issue as QA:Validated.
Please let us know if we are missing anything here. Thanks!
Stack version 8.12.1 and possibly others.
There seems to be an issue of Agents
upgraded_at
field keep being updated to current time, and this results in Fleet UI now showingUpgrade available
when it should, andUpgrade agent
action being disabled, because Fleet UI doesn't consider agent upgradeable if the agent was updated in the last 10 minutes.It's not clear yet if the issue is on fleet-server or agent side.
Reproduced on a fresh 8.12.1 cluster, by enrolling a 8.11.4 agent, upgrade to 8.12.0 and wait 10 minutes. The agent is still not allowed to be upgraded again to 8.12.1, and the
upgraded_at
field looks recent, event though the last upgrade happened more than 10m ago.Workaround:
"force": true
flag, orupgrade_details:null
value from agent docs