Open harshitgupta-qasource opened 4 months ago
Pinging @elastic/fleet (Team:Fleet)
@amolnater-qasource Kindly review
Secondary review for this ticket is Done.
@jillguyonnet - Could you weigh in on this? AFAIU, the "review errors" badge should appear when the polling request detects an error in this case, right?
@kpollich That's correct, with a caveat that the polling request only queries the last 35 seconds (this comment details the logic). It would be good to clarify a few details in order to understand this scenario.
"status":"FAILED"
and associated errors, while the agents on Multipass got an action status item with "status":"COMPLETE"
. Consequently, the "Review errors" badge only showed up for the horde agent. (As a side note, I would like to clarify where this difference is coming from, I'm not sure whether it's expected.)Action status after failed upgrade for horde agent:
{"actionId":"1345158b-e460-462c-b480-48f691147bce","nbAgentsActionCreated":1,"nbAgentsAck":0,"version":"8.11.22","type":"UPGRADE","nbAgentsActioned":1,"status":"FAILED","expiration":"2024-06-13T16:39:46.846Z","creationTime":"2024-05-14T16:39:46.846Z","nbAgentsFailed":1,"hasRolloutPeriod":false,"completionTime":"0001-01-01T00:00:00.000Z","latestErrors":[{"agentId":"3d485f27-db35-41fb-af80-f4b122a254cc","error":"HTTP Fail","timestamp":"0001-01-01T00:00:00Z","hostname":"eh-Snakerowan-5Nbx"}]}
Action status after failed upgrade for 2 agents on Multipass:
{"actionId":"4bc043d8-026a-4e86-8907-8b4beb9f329a","nbAgentsActionCreated":2,"nbAgentsAck":2,"version":"8.12.9","startTime":"2024-05-14T16:24:16.988Z","type":"UPGRADE","nbAgentsActioned":2,"status":"COMPLETE","expiration":"2024-06-13T16:24:16.988Z","creationTime":"2024-05-14T16:24:30.324Z","nbAgentsFailed":0,"hasRolloutPeriod":false,"completionTime":"2024-05-14T16:39:24.952Z","latestErrors":[]}
In the example below, I made failed upgrades by manually entering an invalid version; the horde agent failed upgrade resulted in an action status item with "status":"FAILED" and associated errors, while the agents on Multipass got an action status item with "status":"COMPLETE". Consequently, the "Review errors" badge only showed up for the horde agent. (As a side note, I would like to clarify where this difference is coming from, I'm not sure whether it's expected.)
The horde implementation has diverged from the agent somehow, but it's not clear just reading this what it might be.
What version did you use when you tested this? Depending on the exact format it might hit different parts of the agent code. For example if it looked valid but didn't exist I'd have expected the agent to attempt to download it and report recurring failures doing that.
The horde implementation has diverged from the agent somehow, but it's not clear just reading this what it might be.
What version did you use when you tested this? Depending on the exact format it might hit different parts of the agent code. For example if it looked valid but didn't exist I'd have expected the agent to attempt to download it and report recurring failures doing that.
I agree it's not clear from this testing. The version difference is a good point, so I redid a quick test with the following 3 agents. The TL;DR is horde agents fail fast with a failed request error, probably because they are trying to fetch a nonexistent resource. In contrast, the agent I enrolled manually on a VM did try the upgrade.
UPG_DOWNLOADING
, there was an error message as expected in the upgrade details metadata. The agent stayed in that state for a few minutes before the upgrade became stuck in failed state.Shortly after starting the upgrade:
Agent details page:
After a few minutes, Fleet status is back to healthy:
After a few more minutes, the upgrade stops retrying and a warning message is shown:
Agent details page:
Agent JSON:
HTTP Fail
. I did not see the agent going to Updating status.Immediately after trying to upgrade to 8.6.9:
Agent details page:
Agent activity with error:
For the real agent that is what I expected to see. It will retry the download until the download timeout expires, by default this is two hours. After that it should report the upgrade as failed.
@cmacknz Can we configure the download timeout? It would make testing this a lot easier.
I think that agent didn't respect it when sent from the Fleet override API, but it's been a while since I tested this: https://github.com/elastic/elastic-agent/issues/4580
Kibana Build details:
Preconditions:
Steps to reproduce:
Expected Result: On agent upgrade failure for first time, review error badge should display.
Screen Shot: