Add ability to remotely restart an agent

jlind23 commented 2 years ago

There are some cases where a simple restart of an Agent may resolve common problems. Currently there's no way to do this remotely. In order to allow this action we should offer a new API endpoint that will be shipped under an experimental status for now. This endpoint should one of multiple Agent ID in order to operate a bulk restart if needed.

Depends on

[ ] https://github.com/elastic/fleet-server/issues/2523
[ ] elastic/elastic-agent#3367
[ ] https://github.com/elastic/ingest-dev/issues/1221

This is a two steps issue:

[ ] Allow this for a single Elastic Agent
[ ] Allow this for multiple Elastic Agent

elasticmachine commented 2 years ago

Pinging @elastic/fleet (Team:Fleet)

joshdover commented 1 year ago

Questions:

What will be the agent state after the user starts the restart? When does the state change back to healthy? What if it never successfully restarts?
We need to consider the potential impact on the user's Fleet or Elasticsearch cluster. It's possible that restarting all agents at once leads to a high volume of backlogged data being ingested. If ES performance is degraded, operating Fleet may not be possible.
- Ideally this is something the whole system can handle through back-pressure, but such a test has not been done with Fleet.
- Should we allow or require that users schedule bulk restarts with a maintenance window to avoid this, at least for more than X agents? Or warn them about the potential for high data volumes/instability?

nimarezainia commented 1 year ago

Closing in favour of https://github.com/elastic/ingest-dev/issues/1221

juliaElastic commented 1 year ago

@nimarezainia Is this issue intentionally reopened?

nimarezainia commented 1 year ago

@nimarezainia Is this issue intentionally reopened?

@juliaElastic this is the public issue. I had closed it in favor of the private one to reduce duplicates by mistake. We should close the public issue once the implementation is compete. hope this makes sense. the private issue has the bulk of the prioritization and implementation discussions.

joshdover commented 1 year ago

I think we're still not yet aligned on whether or not we want to support this at all. If we do support it, I think it should be an advanced action not exposed in the UI and we should have telemetry to track usage as ideally this isn't needed often.

ThomSwiss commented 1 year ago

We have currently 1150 agents out in our environment. The most of them send there data to on of two logstashes.

Each time, after a restart of logstash, all agents look to work fine, but some are not able so send data anymore. They are still visible as helthy. In kibana I couldn't find anything bad. But the didn't send data anymore. If I restart the elastic-agent, it works fine. That is the reason, why we I need this feature.

jlind23 commented 1 year ago

@amolnater-qasource As part of the Logstash test cases you run, is this included? If not, worth adding it then.

amolnater-qasource commented 1 year ago

Thank you for the update @jlind23

We have added a testcase where the Logstash is restarted when connected to the elastic-agent under Fleet test suite at link:

Validate on restarting logstash connected to an agent, agent restarts sending data under the Data Streams tab.

Please let us know if we are missing anything here. Thanks!

jlind23 commented 1 year ago

@amolnater-qasource csn you please check this as soon as possible? I want to check if we have a really bad problem here.

amolnater-qasource commented 1 year ago

@jlind23 We have revalidated this scenario on latest 8.10.0 BC2 kibana cloud environment and found this issue not reproducible there.

Observations:

On restarting logstash output, new data is generated for the connected agent after 10-15 seconds as soon as Logstash is up.

For reconfirming we tried several times to reproduce this, however the data resumed for the agent as soon as logstash service gets up.

Few other scenarios tried:

Restarted Elastic-Agent from services and then restarted logstash.
Stopped the agent till it went offline and then getting the host back up. After that restarting the logstash.
Set agent logs to debug level and then tried to restart logstash.

This issue isn't reproducible this way too.

Screen Recording: Before Restart:

https://github.com/elastic/kibana/assets/77374876/eb175e25-576d-40e5-ad1a-599515642a62

After Restart:

https://github.com/elastic/kibana/assets/77374876/7a2a6d92-91a6-4de9-b47b-5acb186d5b34

Build details: VERSION: 8.10.0 BUILD: 66107 BC2 COMMIT: fa3473f42d7c5e7a3c2d66026a153e01002f5d3c

Please let us know if anything else is required from our end.

Thanks!

amitkanfer commented 1 year ago

@ThomSwiss please let us know if we're running the tests in a different way, we're unable to reproduce. If this does reproduce for you, would be great to share your agent diagnostics files and we're happy to investigate further.

nimarezainia commented 1 year ago

@ThomSwiss also what version are you on?

ThomSwiss commented 1 year ago

@amitkanfer, @nimarezainia Thanks for your help!

We use the newest Agent version 8.9.1. We had the same issue also with older releases. We do not all releases, but I am sure that this was a problem with 8.7.x releases as well.

I try to do a query on the reached data to find out which clients don't send data anymore. Than I can run diagnostics on it. I hope to get this answer in the next 1-2 days.

ThomSwiss commented 1 year ago

I did a lot of tests the last 2 days. I can now tell you: Elastic Agent is correctly working after restart logstash. My

Testcase:

With a powershell script, runned many times
- Get all 984 Elastic Agents with status healthy, all Windows
- Count the number of records we received during the last 30 minutes/per Agent on dataview winlogbeat (includes logs-system.application,logs-system.security,logs-windows.powershell,logs-windows.powershell_operational and a view more)
- List all Agents that send less than 30 records during the last 30 minutes
Compare this lists during many runs
Restart the two logstashes that receives Elastic Agent input on port 5044
Result: We received also data after restarting logstash

I am sorry for my wrong post. But I am still unclear, when this happend in the past. I remember at least two occourencies in the last 2 years, where we had to restart all agents to get them back to send correctly data. I guess, sometimes it also helped when we just changed the fleet policy. So for example added or disabled Powershell logs in the windows integration. I have now this script and also my logs. I will check carefully, if this appears again and will come back, if I have details. Also with diagnostics.

Thanks for your work! Elastic is a great product.

jlind23 commented 1 year ago

@pierrehilbert @blakerouse Does Elastic Agent have a restart command than be sent down from fleet? Just like upgrade or any other actions?

pierrehilbert commented 1 year ago

From what I know, we don't have an action handler to restart the Agent. @blakerouse if you can keep me honest here

blakerouse commented 1 year ago

Correct. The Elastic Agent doesn't support that action.

ThomSwiss commented 1 year ago

Today I had a problem with an Elastic Agent Custom log integration: I did an error in the processor field (the Kibana Fleets GUI, didn't show me an error. I had two \ sign in a replace pattern ). I saved it successfully. Later the Agent changed to not healty.

I corrected the error. But the client did not change to healty. The message did not disapper. I waited at least 15 minutes. Then I restarted the elastic agent (had to login to the device). After restart, all was fine. If you are interested, I did a analytics before I restarted. This is a typical use case for a restart.

jlind23 commented 1 year ago

@nimarezainia updated the issue description following the chat we had. cc @kpollich for awareness

allamiro commented 1 year ago

Questions:

What will be the agent state after the user starts the restart? When does the state change back to healthy? What if it never successfully restarts?

We need to consider the potential impact on the user's Fleet or Elasticsearch cluster. It's possible that restarting all agents at once leads to a high volume of backlogged data being ingested. If ES performance is degraded, operating Fleet may not be possible.

Ideally this is something the whole system can handle through back-pressure, but such a test has not been done with Fleet.

Should we allow or require that users schedule bulk restarts with a maintenance window to avoid this, at least for more than X agents? Or warn them about the potential for high data volumes/instability?

This is my suggestion : I believe restricting the initiation of no more than 10 to 20 agents simultaneously could help bypass scheduling during a maintenance window. If there's a need to restart more than 20 agents, the system should prompt the admin to schedule a maintenance window outside of operational hours. When executing bulk restarts, the system shouldn't restart all agents simultaneously; instead, it should process them in batches of 20 to 30 at a time.

nimarezainia commented 1 year ago

This is my suggestion : I believe restricting the initiation of no more than 10 to 20 agents simultaneously could help bypass scheduling during a maintenance window. If there's a need to restart more than 20 agents, the system should prompt the admin to schedule a maintenance window outside of operational hours. When executing bulk restarts, the system shouldn't restart all agents simultaneously; instead, it should process them in batches of 20 to 30 at a time.

thanks for this information. Since we are providing this capability via an API only, wouldn't the logic you describe better be accomodated by the user's code that invokes this API?

zez3 commented 11 months ago

This would also help if the metricbeat or other beasts will contain memory leak bugs in the future.

msecpim commented 1 week ago

We do have a plus 30K agent infrastructure, with agents running also in remote locations. Utilising such an API would be of great advantage. Do you have any update on when that will become available?

ThomSwiss commented 1 week ago

We have plus 12K agents and sometimes have Agents that doesn't do anything. After the last windows patch, we had again to restart some agents, because they didn't run correctly, they just don't send data. After restart, all was fine.

nimarezainia commented 1 week ago

We have plus 12K agents and sometimes have Agents that doesn't do anything. After the last windows patch, we had again to restart some agents, because they didn't run correctly, they just don't send data. After restart, all was fine.

@ThomSwiss this shouldn't be happening and I consider it a bug. Could you open an support case with us if possible to the issue can be diagnosed. Not denying that this feature would be useful, just want to ensure the primary problem is addressed. Would be great to obtain the diagnostics file or any error that you see which agents produce which could give us a clue.

elastic / kibana

Add ability to remotely restart an agent #144585

Depends on