elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.73k stars 8.14k forks source link

[Fleet] Interrupting Agent Updates #178735

Open Harmlos opened 6 months ago

Harmlos commented 6 months ago

Describe the feature:

It is necessary to add the ability to cancel the update of one or all agents.

Describe a specific use case for the feature:

Sometimes, an issue arises where initiating updates for multiple agents leads to significant network bandwidth consumption. Agents attempt to update at different intervals, and they display an "updating" status until the update is installed.

Being able to cancel updates from the Fleet console will allow for better management of network load caused by agents and the occasional cancellation of erroneous actions.

elasticmachine commented 6 months ago

Pinging @elastic/fleet (Team:Fleet)

nimarezainia commented 5 months ago

@Harmlos how many agents do you typically upgrade at the same time? You do have the ability to define a window of time for the upgrade to be scheduled in so that the upgrades are spread out in order to alleviate this network saturation event.

Harmlos commented 5 months ago

@nimarezainia The problem is that I don't know which network each computer belongs to. It could be either from the main office network or from a remote branch network, where 10 computers are connected to the office via a very weak channel. These are the peculiarities of the network architecture.

Considering the specifics of the operation, everything seems fine for the user and the company - email works.

Launching an update for one agent affects the channel. Moreover, even one agent cannot download the update and fails to complete the download.

And not downloading the update once - this very agent retries to download the update, leading to the full utilization of the already weak channel.

There is no possibility to track such a state automatically, only by reviewing the events of the distribution server.

`192.168.67.149 - - [03/Apr/2024:13:53:29 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 209207041 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:13:57:08 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 208322305 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:02:08 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 207945473 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:06:02 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 211287809 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:13:52 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 208027393 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:17:08 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 208109313 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:20:32 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 210009857 "-" "Beat elastic-agent v8.9.2"

192.168.67.149 - - [03/Apr/2024:14:24:16 +0000] "GET /downloads/beats/elastic-agent/elastic-agent-8.12.2-windows-x86_64.zip HTTP/1.1" 200 209583873 "-" "Beat elastic-agent v8.9.2"`

nimarezainia commented 5 months ago

@Harmlos We can provide a way to cancel actions (upgrade as an example) that haven't been acted on yet by the agent. This is mainly for cases where the admin realizes that there may be something wrong with the image and stop it's spread. However for your use case I fear that this become a case of trial and error. You are planning on issuing upgrade to block of agents and monitoring your networks and then potentially cancelling. Then perhaps repeating the process. Seems disruptive.

Instead pick a set number of agents, across a large enough window and know that Fleet will take care of upgrading them across that window of time. Much more deterministic. In addition you have the ability to set a future time for this to happen (during a maintanance window perhaps)

Harmlos commented 5 months ago

We are trying various methods to initiate agent updates. One of the options is to use a standalone script to retrieve a list of agents and sequentially start updates for them.

The problem is that if issues are detected on an agent, there is no way to stop the update. The only option is to restart the agent on the host to stop it from being stuck in the update status.

It would be very convenient to have the ability to cancel agent updates with a button or via an API, for example. In this case, it would be possible to describe the update logic in an external script, and only use the API functions to start or cancel updates in case of issues on the host.