elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
18 stars 144 forks source link

[Fleet] The agent upgrade download timeout should be configurable. #4580

Open cmacknz opened 1 year ago

cmacknz commented 1 year ago

In https://github.com/elastic/elastic-agent/pull/1666 we increased the default agent upgrade artifact download timeout from 10 minutes to 2 hours. This was done because we were observing upgrade attempts that time out due to poor network conditions.

I recently observed a failed upgrade attempt from a connection with a download speed as low as 33.74kBps. With a 274.9MB download size this would take 274.9MB / 33.74kBps = 138 minutes = 2 hours 18 minutes to complete. At this speed the increased 2 hour limit is still not enough.

{
    "agent": {
      "version": "8.6.1"
    },
    "message": "download progress from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.7.1-windows-x86_64.zip is 1.012MB/274.9MB (0.37% complete) @ 33.74kBps",
    "@timestamp": "2023-05-09T18:15:10.070Z"
}
{
    "agent": {
      "version": "8.6.1"
    },
    "message": "download from https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.7.1-windows-x86_64.zip failed at 39.19MB/274.9MB (14.26% complete) @ 65.43kBps: net/http: request canceled (Client.Timeout or context cancellation while reading body)",
    "@timestamp": "2023-05-09T17:21:40.219Z"
}

For a standalone agent the upgrade download timeout can be controlled with the agent.download.timeout parameter: https://github.com/elastic/elastic-agent/blob/29bf0ab601895fc498abf8724fc1918783956de5/elastic-agent.reference.yml#L70-L77

agent.download:
  # source of the artifacts, requires elastic like structure and naming of the binaries
  # e.g /windows-x86.zip
  sourceURI: "https://artifacts.elastic.co/downloads/beats/"
  # path to the directory containing downloaded packages
  target_directory: "${path.data}/downloads"
  # timeout for downloading package
  timeout: 120s

For Fleet managed agents we should allow configuring the download timeout to use. It is not possible for us to know what the download timeout for a particular network should be, and we should make sure users with poor connections that are slightly over the limit to adjust this easily.

I suggest adding the download timeout under the existing Agent Binary Download Settings configuration:

Screen Shot 2023-05-10 at 3 37 21 PM
elasticmachine commented 1 year ago

Pinging @elastic/fleet (Team:Fleet)

nimarezainia commented 1 year ago

@cmacknz thanks for this. Couple of questions: 1) is there a retry mechanism at all when a download has failed? 2) are the download rates stored in any data stream perhaps in the checkin payload so that UI could show this in the agent details page? I think we just need "last download took: bluh hrs" - "rate of download: xx"

cmacknz commented 1 year ago

is there a retry mechanism at all when a download has failed?

Yes this is controlled at the action level, I think there are something like 5 retries by default.

are the download rates stored in any data stream perhaps in the checkin payload so that UI could show this in the agent details page? I think we just need "last download took: bluh hrs" - "rate of download: xx"

The download rate is only in the logs for now. We could include this in an upgrade specific data stream that has the state of all previous upgrades or something.

juliaElastic commented 1 year ago

From 8.9 we can configure this with the agent policy update API: https://github.com/elastic/kibana/issues/158699#issuecomment-1609143328

I think we should write a KB article on using the overrides API with the download timeout setting. See Link

jlind23 commented 7 months ago

Closing this as done thanks to https://github.com/elastic/kibana/pull/179795

juliaElastic commented 6 months ago

Reopening this issue because configuring the download timeout on the agent policy doesn't seem to apply on the agent side: https://github.com/elastic/ingest-dev/issues/2471#issuecomment-2046832307 We added this section to the docs: https://www.elastic.co/guide/en/fleet/current/enable-custom-policy-settings.html#configure-agent-download-timeout We should either support this in agent, or remove from the docs until then.

elasticmachine commented 6 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz commented 6 months ago

We should either support this in agent, or remove from the docs until then.

Supporting this in agent is the answer, users need to be able to increase the download timeout if needed.

elasticmachine commented 6 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)