Open cmacknz opened 1 year ago
Pinging @elastic/fleet (Team:Fleet)
@cmacknz thanks for this. Couple of questions: 1) is there a retry mechanism at all when a download has failed? 2) are the download rates stored in any data stream perhaps in the checkin payload so that UI could show this in the agent details page? I think we just need "last download took: bluh hrs" - "rate of download: xx"
is there a retry mechanism at all when a download has failed?
Yes this is controlled at the action level, I think there are something like 5 retries by default.
are the download rates stored in any data stream perhaps in the checkin payload so that UI could show this in the agent details page? I think we just need "last download took: bluh hrs" - "rate of download: xx"
The download rate is only in the logs for now. We could include this in an upgrade specific data stream that has the state of all previous upgrades or something.
From 8.9 we can configure this with the agent policy update API: https://github.com/elastic/kibana/issues/158699#issuecomment-1609143328
I think we should write a KB article on using the overrides API with the download timeout setting. See Link
Closing this as done thanks to https://github.com/elastic/kibana/pull/179795
Reopening this issue because configuring the download timeout on the agent policy doesn't seem to apply on the agent side: https://github.com/elastic/ingest-dev/issues/2471#issuecomment-2046832307 We added this section to the docs: https://www.elastic.co/guide/en/fleet/current/enable-custom-policy-settings.html#configure-agent-download-timeout We should either support this in agent, or remove from the docs until then.
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
We should either support this in agent, or remove from the docs until then.
Supporting this in agent is the answer, users need to be able to increase the download timeout if needed.
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
In https://github.com/elastic/elastic-agent/pull/1666 we increased the default agent upgrade artifact download timeout from 10 minutes to 2 hours. This was done because we were observing upgrade attempts that time out due to poor network conditions.
I recently observed a failed upgrade attempt from a connection with a download speed as low as 33.74kBps. With a 274.9MB download size this would take 274.9MB / 33.74kBps = 138 minutes = 2 hours 18 minutes to complete. At this speed the increased 2 hour limit is still not enough.
For a standalone agent the upgrade download timeout can be controlled with the
agent.download.timeout
parameter: https://github.com/elastic/elastic-agent/blob/29bf0ab601895fc498abf8724fc1918783956de5/elastic-agent.reference.yml#L70-L77For Fleet managed agents we should allow configuring the download timeout to use. It is not possible for us to know what the download timeout for a particular network should be, and we should make sure users with poor connections that are slightly over the limit to adjust this easily.
I suggest adding the download timeout under the existing Agent Binary Download Settings configuration: