elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.73k stars 8.14k forks source link

[Fleet]: On agent upgrade failure for first time, review error badge is not displayed #183243

Open harshitgupta-qasource opened 4 months ago

harshitgupta-qasource commented 4 months ago

Kibana Build details:

VERSION: 8.14.0 BC4
BUILD: 73836
COMMIT: 23ed1207772b3ae958cb05bc4cdbe39b83507707

Preconditions:

  1. 8.14.0-BC4 Kibana cloud environment should be available.
  2. 8.13.4 agent should be deployed
  3. Wrong Agent Binary should be added.

Steps to reproduce:

  1. Navigate to Fleet>Agents tab.
  2. Select 2-3 agent after click on checkbox.
  3. Click on action button and then select upgrade agent.
  4. Enter the latest agent version and perform upgrade agent.
  5. Wait for 10-20 minutes.
  6. Observe that agent upgrade failure for first time, review error badge is not displayed.

Expected Result: On agent upgrade failure for first time, review error badge should display.

Screen Shot: image

elasticmachine commented 4 months ago

Pinging @elastic/fleet (Team:Fleet)

harshitgupta-qasource commented 4 months ago

@amolnater-qasource Kindly review

amolnater-qasource commented 4 months ago

Secondary review for this ticket is Done.

kpollich commented 4 months ago

@jillguyonnet - Could you weigh in on this? AFAIU, the "review errors" badge should appear when the polling request detects an error in this case, right?

jillguyonnet commented 4 months ago

@kpollich That's correct, with a caveat that the polling request only queries the last 35 seconds (this comment details the logic). It would be good to clarify a few details in order to understand this scenario.

  1. The first thing to check should be whether there is an actual error in the agent activity flyout. While I was testing this, I noticed that there wasn't one in all scenarios. In the example below, I made failed upgrades by manually entering an invalid version; the horde agent failed upgrade resulted in an action status item with "status":"FAILED" and associated errors, while the agents on Multipass got an action status item with "status":"COMPLETE". Consequently, the "Review errors" badge only showed up for the horde agent. (As a side note, I would like to clarify where this difference is coming from, I'm not sure whether it's expected.)

Action status after failed upgrade for horde agent:

{"actionId":"1345158b-e460-462c-b480-48f691147bce","nbAgentsActionCreated":1,"nbAgentsAck":0,"version":"8.11.22","type":"UPGRADE","nbAgentsActioned":1,"status":"FAILED","expiration":"2024-06-13T16:39:46.846Z","creationTime":"2024-05-14T16:39:46.846Z","nbAgentsFailed":1,"hasRolloutPeriod":false,"completionTime":"0001-01-01T00:00:00.000Z","latestErrors":[{"agentId":"3d485f27-db35-41fb-af80-f4b122a254cc","error":"HTTP Fail","timestamp":"0001-01-01T00:00:00Z","hostname":"eh-Snakerowan-5Nbx"}]}

Action status after failed upgrade for 2 agents on Multipass:

{"actionId":"4bc043d8-026a-4e86-8907-8b4beb9f329a","nbAgentsActionCreated":2,"nbAgentsAck":2,"version":"8.12.9","startTime":"2024-05-14T16:24:16.988Z","type":"UPGRADE","nbAgentsActioned":2,"status":"COMPLETE","expiration":"2024-06-13T16:24:16.988Z","creationTime":"2024-05-14T16:24:30.324Z","nbAgentsFailed":0,"hasRolloutPeriod":false,"completionTime":"2024-05-14T16:39:24.952Z","latestErrors":[]}
Screenshot 2024-05-14 at 18 50 02
  1. If there is an actual error, the next thing to investigate is whether the polling request actually catches it (which would cause the badge to render). As noted above, the polling request fetches the most recent actions from the last 35 seconds; in theory, if the Agents page stays open and is not refreshed, then that should happen at some point. It would be great to confirm that it doesn't render and then disappear (which could unfortunately be tedious). Otherwise, if the badge never renders, I'm wondering if this might be a case of the action being "older" (i.e. created before the upgrade failed) and updated to failed status, which would cause the polling to never catch it. If the latter scenario is confirmed, then it's definitely a bug.
cmacknz commented 4 months ago

In the example below, I made failed upgrades by manually entering an invalid version; the horde agent failed upgrade resulted in an action status item with "status":"FAILED" and associated errors, while the agents on Multipass got an action status item with "status":"COMPLETE". Consequently, the "Review errors" badge only showed up for the horde agent. (As a side note, I would like to clarify where this difference is coming from, I'm not sure whether it's expected.)

The horde implementation has diverged from the agent somehow, but it's not clear just reading this what it might be.

What version did you use when you tested this? Depending on the exact format it might hit different parts of the agent code. For example if it looked valid but didn't exist I'd have expected the agent to attempt to download it and report recurring failures doing that.

jillguyonnet commented 4 months ago

The horde implementation has diverged from the agent somehow, but it's not clear just reading this what it might be.

What version did you use when you tested this? Depending on the exact format it might hit different parts of the agent code. For example if it looked valid but didn't exist I'd have expected the agent to attempt to download it and report recurring failures doing that.

I agree it's not clear from this testing. The version difference is a good point, so I redid a quick test with the following 3 agents. The TL;DR is horde agents fail fast with a failed request error, probably because they are trying to fetch a nonexistent resource. In contrast, the agent I enrolled manually on a VM did try the upgrade.

  1. An agent on Multipass VMs enrolled on version 8.12.0. I tried an upgrade to 8.12.9: fairly quickly, when the agent's upgrade details had status UPG_DOWNLOADING, there was an error message as expected in the upgrade details metadata. The agent stayed in that state for a few minutes before the upgrade became stuck in failed state.

Shortly after starting the upgrade:

Screenshot 2024-05-15 at 10 11 38

Agent details page:

Screenshot 2024-05-15 at 10 11 54

After a few minutes, Fleet status is back to healthy:

Screenshot 2024-05-15 at 10 25 50

After a few more minutes, the upgrade stops retrying and a warning message is shown:

Screenshot 2024-05-15 at 10 31 12

Agent details page:

Screenshot 2024-05-15 at 10 31 24

Agent JSON:

Click to expand ```json { "id": "d4666fb3-eb14-43ac-bd23-689b06ae0b60", "type": "PERMANENT", "active": true, "enrolled_at": "2024-05-15T08:08:37Z", "upgraded_at": "2024-05-15T08:25:26Z", "upgrade_started_at": null, "upgrade_details": { "metadata": { "retry_error_msg": "unable to download package: 2 errors occurred:\n\t* package '/opt/Elastic/Agent/data/elastic-agent-5cbf2e/downloads/elastic-agent-8.12.9-linux-arm64.tar.gz' not found: open /opt/Elastic/Agent/data/elastic-agent-5cbf2e/downloads/elastic-agent-8.12.9-linux-arm64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.12.9-linux-arm64.tar.gz' returned unsuccessful status code: 404\n\n", "retry_until": "2024-05-15T12:10:48.014713177+02:00", "error_msg": "failed download of agent binary: unable to download package: 2 errors occurred:\n\t* package '/opt/Elastic/Agent/data/elastic-agent-5cbf2e/downloads/elastic-agent-8.12.9-linux-arm64.tar.gz' not found: open /opt/Elastic/Agent/data/elastic-agent-5cbf2e/downloads/elastic-agent-8.12.9-linux-arm64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.12.9-linux-arm64.tar.gz' returned unsuccessful status code: 404\n\n", "failed_state": "UPG_DOWNLOADING" }, "action_id": "6fd4c70e-0fe7-40e5-995c-66b89877b5bd", "state": "UPG_FAILED", "target_version": "8.12.9" }, "access_api_key_id": "znRLe48BFQaBON2J0pQx", "policy_id": "e9b7752e-2527-4759-a11d-01220a89fcec", "last_checkin": "2024-05-15T08:38:58Z", "last_checkin_status": "online", "last_checkin_message": "Running", "policy_revision": 1, "packages": [], "sort": [ 1715760517000 ], "outputs": { "default": { "api_key_id": "0HRLe48BFQaBON2J3ZSu", "type": "elasticsearch" } }, "components": [ { "id": "log-default", "type": "log", "status": "HEALTHY", "message": "Healthy: communicating with pid '2232'", "units": [ { "id": "log-default-logfile-system-723bd4a9-11af-4eef-bb5e-06d03c84f17b", "type": "input", "status": "HEALTHY", "message": "Healthy" }, { "id": "log-default", "type": "output", "status": "HEALTHY", "message": "Healthy" } ] }, { "id": "system/metrics-default", "type": "system/metrics", "status": "HEALTHY", "message": "Healthy: communicating with pid '2237'", "units": [ { "id": "system/metrics-default-system/metrics-system-723bd4a9-11af-4eef-bb5e-06d03c84f17b", "type": "input", "status": "HEALTHY", "message": "Healthy" }, { "id": "system/metrics-default", "type": "output", "status": "HEALTHY", "message": "Healthy" } ] }, { "id": "filestream-monitoring", "type": "filestream", "status": "HEALTHY", "message": "Healthy: communicating with pid '2242'", "units": [ { "id": "filestream-monitoring-filestream-monitoring-agent", "type": "input", "status": "HEALTHY", "message": "Healthy" }, { "id": "filestream-monitoring", "type": "output", "status": "HEALTHY", "message": "Healthy" } ] }, { "id": "beat/metrics-monitoring", "type": "beat/metrics", "status": "HEALTHY", "message": "Healthy: communicating with pid '2249'", "units": [ { "id": "beat/metrics-monitoring-metrics-monitoring-beats", "type": "input", "status": "HEALTHY", "message": "Healthy" }, { "id": "beat/metrics-monitoring", "type": "output", "status": "HEALTHY", "message": "Healthy" } ] }, { "id": "http/metrics-monitoring", "type": "http/metrics", "status": "HEALTHY", "message": "Healthy: communicating with pid '2256'", "units": [ { "id": "http/metrics-monitoring-metrics-monitoring-agent", "type": "input", "status": "HEALTHY", "message": "Healthy" }, { "id": "http/metrics-monitoring", "type": "output", "status": "HEALTHY", "message": "Healthy" } ] } ], "agent": { "id": "d4666fb3-eb14-43ac-bd23-689b06ae0b60", "version": "8.12.0" }, "local_metadata": { "elastic": { "agent": { "build.original": "8.12.0 (build: 5cbf2e403c761f91d11eca6b9cb5385e0f07f2ce at 2024-01-11 13:25:49 +0000 UTC)", "complete": false, "id": "d4666fb3-eb14-43ac-bd23-689b06ae0b60", "log_level": "info", "snapshot": false, "upgradeable": true, "version": "8.12.0" } }, "host": { "architecture": "aarch64", "hostname": "agent1", "id": "1b252dddb2544378813a2756173ad9ab", "ip": [ "127.0.0.1/8", "::1/128", "192.168.82.10/24", "fdf3:d299:5a7d:9ea6:5054:ff:fe1a:b5b9/64", "fe80::5054:ff:fe1a:b5b9/64" ], "mac": [ "52:54:00:1a:b5:b9" ], "name": "agent1" }, "os": { "family": "debian", "full": "Ubuntu noble(24.04 LTS (Noble Numbat))", "kernel": "6.8.0-31-generic", "name": "Ubuntu", "platform": "ubuntu", "version": "24.04 LTS (Noble Numbat)" } }, "unhealthy_reason": null, "status": "online", "metrics": { "cpu_avg": 0.01083, "memory_size_byte_avg": 139992856 } } ```
  1. A horde agent enrolled on version 8.6.0 (default) as in my previous test. The upgrade to a nonexistent version quickly failed with HTTP Fail. I did not see the agent going to Updating status.

Immediately after trying to upgrade to 8.6.9:

Screenshot 2024-05-15 at 10 12 48

Agent details page:

Screenshot 2024-05-15 at 10 13 00

Agent activity with error:

Screenshot 2024-05-15 at 10 13 35
  1. Another horde agent enrolled on version 8.12.0. The upgrade to a nonexistent version failed in the same way as the 8.6.0 horde agent.
cmacknz commented 4 months ago

For the real agent that is what I expected to see. It will retry the download until the download timeout expires, by default this is two hours. After that it should report the upgrade as failed.

jillguyonnet commented 4 months ago

@cmacknz Can we configure the download timeout? It would make testing this a lot easier.

cmacknz commented 4 months ago

I think that agent didn't respect it when sent from the Fleet override API, but it's been a while since I tested this: https://github.com/elastic/elastic-agent/issues/4580