elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
124 stars 134 forks source link

Fleet policy deployment on multiple Agents in a consecutive/sequenced 1by1 way #474

Open gemaliano opened 2 years ago

gemaliano commented 2 years ago

Describe the enhancement: Looking for a way for fleet policy on multiple underlying Agents in a consecutive/sequenced way

Describe a specific use case for the enhancement or feature: Fleet policy that is configured on multiple Agents that needs some changes, but that change might break the agents health status. We would like rollout of changes preferably waiting until each Agent is healthy again.

ph commented 2 years ago

@jen-huang @nimarezainia I think this should be move in kibana repository, this is some kind of canary deployment.

zez3 commented 2 years ago

The following might be also useful to solve first: https://github.com/elastic/elastic-agent/issues/100 https://github.com/elastic/elastic-agent/issues/113 https://github.com/elastic/elastic-agent/issues/120

Let me better describe the ER open by Gema and our current use case:

3(or more to come) Agents running on 3 physical hosts(with more than generous cpu cores&ram each) using the same Fleet policy with multiple(20+)inputs(integrations).

If we do a change on that already applied policy(eg. Output from default to an output with dead letter) the Agents do not wait for each other to finish the change until they get back into a healthy state. It might be that all transition to an unhealthy state because of an wrong custom input config or whatever else. We would like to avoid that. There is no sequenced or awarness option, neither in Fleet nor in the Agent at the moment, where if one policy change on an agent fails the others should continue working (living). The current unsequenced way would make sense when you have a huge number of Agents and do not care if some breake but not for this use case. At least one should live. In our case all inputs are defined on a front LoandBalancer because some network equipments(L3 switches) do not accept other syslog ports, so we need to configure our LB ports(eg. 514 or 529) with Agents that are in the back listening on higher ports. We have a basic health check but Ideally we would check the Agent http endpoint status. That is also not available/configurable in Fleet policy.

@ruflin showed in one of his presentations an very interesting possible feature/enhancement: HighAvailability_over_multiple_Agents

I would like to have something similar available but use our own LB to determine if the Agents and underlying inputs are alive. Not necessarily use a single elected Agent because we might distribute the load of multiple log/event sources(eg. more than one Firewall, syslog switches) to different Agents. The move input would be indeed helpful and desired but again preferably steered from our LB heath checks.

nimarezainia commented 2 years ago

@zez3 you have noted some of the enhancements regarding the reporting of the status of an integration which are the first steps in what's being requested here.

We will be looking at implementing more control over how the policy gets distributed. You can track this public issue: https://github.com/elastic/kibana/issues/108267

Right now as you noted the policy, when updated, is rolled out to all the agents in that policy. We have to define a way that a tranche of agents get the update first then another tranche. Would that be acceptable in this case or do you think we need to update a tranche and wait for human intervention to carry on to the next? (in some cases as a platform we wouldn't know if an input is operating correctly)

zez3 commented 2 years ago

Right now as you noted the policy, when updated, is rolled out to all the agents in that policy. We have to define a way that a tranche of agents get the update first then another tranche. Would that be acceptable in this case or do you think we need to update a tranche and wait for human intervention to carry on to the next? (in some cases as a platform we wouldn't know if an input is operating correctly)

I guess that the update attempt(verification step) should be perhaps based on OS, because the same integration (e.g. System, AV, others?) can be applied to windows, linux, mac. From my current point of view, the update could be attempted to only a single Agent(it could be a tranche if there are 3 different OSs). Then wait/check with the help of https://github.com/elastic/elastic-agent/issues/100 if that is successfully. If the attempt fails, stop applying the policy for the rest and anounce the user which exact integration failed. The faulty attempt happen to me mostly on custom log integrations but recently (since 8.2) to the output with dead letter. So, I suppose it's not only the integrations input that need to be checked/successfully. Do we need also a check feature for Agent outputs?

Regarding "as a platform"(not my case) perhaps it should, like now, silently fail...I need more info to comment on this.

Also this feature discussed here is perhaps not desired by all. I think, it should/could be activated on demand for specific policies or globally in fleet.