Make Fleet kibana bulk action execution async

joshdover commented 2 years ago

Currently, agent actions performed from Fleet UI go through Fleet API and are executed synchronously. This worked well for a small number of agents, but it is not scalable. In 8.4, actions were optimized in a way to introduce batching on the Fleet API side, so that the actions are executed in 10k batches. This unblocked actions for up to 50-60k selections, but it will hit a limit of Network timeout of 1 or 2 mins depending on configuration.

In order to support larger scales, we make action execution asynchronous, so that the execution is decoupled from the Fleet API call.

Changes:

the change impacts bulk actions in Fleet API: reassign, unenroll, upgrade, update tags
if selected agents <= 10k, the execution will remain sync
for larger selections, the execution will start asynchronously, and the API immediately returns
a new /action_status API is added to poll the status of actions
on action execution failures, a task is started to retry (max 3 times) from the last successful batch using Kibana Task Manager
there will be an action document created for every 10k agents like before

Previous description:

In https://github.com/elastic/kibana/issues/133388 we updated all of Kibana's logic for creating bulk actions to use a batched approach which will create a new action document for each batch of 10k agents in the full list of search results. This allows us to handle larger scales of agents >10k in these bulk actions, but it still makes the UI responsiveness slower as the number of agents a user is taking action on increases.

Instead, we could move to a model where Kibana creates a single document for the bulk action which includes the query parameters for the matching agents and an Elasticsearch point-in-time finder. Fleet Server could then consume this document and run the query with the PIT finder to identify the same agents that the user selected in the UI for the bulk action.

This would allow for UI snappiness at any scale and prevent any problems like proxy timeouts prevent users from taking bulk actions on very large numbers (100k+) of agents.

Challenges with this approach:

How do we handle multiple Fleet Servers?
How do we make sure the PIT finder gets closed after an appropriate amount of time?
What happens if a Fleet Server instance starts processing a bulk action that has a PIT finder that has already been closed?

joshdover commented 2 years ago

cc @juliaElastic would like to get your input on this since you worked on the recent Kibana changes

juliaElastic commented 2 years ago

@joshdover Here are my thoughts on this:

What is the purpose of storing the PIT finder instead of Fleet Server opening it? One reason I can think of is to avoid including new agents in the action that are included in the filter.
PIT finder has a timeout that can be set, so closing it manually is not necessary, but could be done by Fleet Server when it finished the action execution.
Fleet Server would need access to indices .fleet-actions, .fleet-agents, .kibana (for hosted policy check), and kibana version (for upgradeable check). Any challenges with this?
Would we keep the action execution on Fleet API side for single agent (or up to one page)? If so, one caveat is that the action execution logic would be duplicated on Fleet API and Fleet Server, and has to be kept in sync.
Since the batch execution would become async, the UI needs a way to inform users when the action is completed. This could be done by storing the action results in a document, and the UI periodically checking the status (similarly to rolling upgrade).

joshdover commented 2 years ago

What is the purpose of storing the PIT finder instead of Fleet Server opening it? One reason I can think of is to avoid including new agents in the action that are included in the filter.

It may not be strictly necessary, but without creating the PIT in Kibana, there will be some edge cases where a different number of agents are selected for the action than the user saw in the UI. I'd be ok with doing this all in FS at first and seeing how much of a problem that edge case actually confuses users.

Fleet Server would need access to indices .fleet-actions, .fleet-agents, .kibana (for hosted policy check), and kibana version (for upgradeable check). Any challenges with this?

The .kibana access would be a problem, but I think we'd be able to remove that dependency once package_policy_id is included in the .fleet-policies index: https://github.com/elastic/security-team/issues/3918. That said there could be a better way to filter those agents out.

Would we keep the action execution on Fleet API side for single agent (or up to one page)? If so, one caveat is that the action execution logic would be duplicated on Fleet API and Fleet Server, and has to be kept in sync.

Good question, it's probably simpler to move everything to FS. The query that Kibana includes in the action document could be filtering on a list of IDs?

Since the batch execution would become async, the UI needs a way to inform users when the action is completed. This could be done by storing the action results in a document, and the UI periodically checking the status (similarly to rolling upgrade).

Yep, agreed.

joshdover commented 2 years ago

Discussed this with @juliaElastic today. Julia shared that she saw the current behavior take about 10s per 10k agents, meaning we're likely to starting hitting proxy timeouts in Kibana somewhere around 50k-60k (60s).

We will need to prioritize this to be able to execute tests against 100k agents.

joshdover commented 2 years ago

I think this issue could potentially be worked on by engineers on Control Plane or Fleet UI. @jen-huang and @pierrehilbert should discuss ownership, depending on team capacity.

ph commented 2 years ago

I think this features is really similar to our current implementation of the Batching for Upgrade or any other actions that support an agents list target. The only difference is instead of receiving a list of ids you receive a point-in-time search query that would return the list of agents to target with a specific action.

I think that could be implement as a really thin layer over our existing action model, so the system would behave like this.

Fleet Server receive an action.
Fleet Server detect a point in time query.
Fleet Server execute the point in time query.
Fleet Server update the action with the list of agents. The action become a a normal action.
Fleet Server dispatch the action.

The drawback is you could get a pretty big documents in memory 36 bytes for uuid * 60 000 = 2.1mb, but we could optimize the fetching loop if this become a problem. There is already a limit in our system from ES since they have a 100mb limit per document.

Fleet server would receive an initial document like this:

{
"agents":[],
"agent_query": "Query to execute on Elasticsearch",
}

After the query is executed Fleet Server can close right away the point in time query and the dispatch loop would receive this document.

{
"agents":["16be82be-c19c-4e44-b497-ae9ac8ccb053", "b35c6429-cf4b-4a13-98e5-331272a54742", ....],
}

Note: I only keep the useful field in the document above

There a few things we would need to consider know, we need to probably guard on what can be queried by Fleet-Server. Also, I think we would need to expand our bulk logic to more than just an upgrade action.

How do we handle multiple Fleet Servers?

I think it's something we need to solve and verify outside of this work, because upgrade would work the same.

@michel-laterman What do you think about this?

michel-laterman commented 2 years ago

I don't think there's any issues with that plan @ph.

The drawback is you could get a pretty big documents in memory 36 bytes for uuid * 60 000 = 2.1mb, but we could optimize the fetching loop if this become a problem. There is already a limit in our system from ES since they have a 100mb limit per document.

Are these UUIDs the agent IDs?

juliaElastic commented 2 years ago

Yes, those are agent ids.

I think it would be best to break up the execution to multiple batches, so we don't have a limit on how many agents can be actioned. We did it similarly in Fleet API, so we could create multiple action documents, each with up to 10k agents.

Fleet Server dispatch the action.

@ph What do you mean by dispatching the action? Does it mean writing the changes to .fleet-actions index or something else?

How do we handle multiple Fleet Servers?

Is concurrency the main issue here? Could we store state on the action document with PIT, to keep track whether a Fleet Server has started executing the pit query? This would prevent multiple Fleet Servers picking up the same action. However this would be a general problem to solve on existing upgrade action as @ph mentioned.

What happens if a Fleet Server instance starts processing a bulk action that has a PIT finder that has already been closed?

In the worst case scenario, what could be the most delay when Fleet Server picks up the request? The PIT timeout can be set higher, but that takes more resources for ES to keep the snapshot in memory.
Fleet UI/API needs a way to report on actions that timed out, so the user can retry manually.
An automatic retry can't be done by querying the agents without the PIT id, as that might include new agents that were not intended to be actioned.
This problem could be mitigated by keeping the querying of agent ids on Fleet API side. This would only work if the querying does not become a bottleneck on large scale.

juliaElastic commented 2 years ago

Linking test results with 15k agents for reference: https://github.com/elastic/kibana/pull/134565#issuecomment-1164046122

aleksmaus commented 2 years ago

Hi! I'm trying to understand the problem here and most likely missing some history and/or context. So, the problem is selecting many agent ids into kibana that is currently slow, and we are moving this logic to the fleet server, correct? Assuming the query result for the agent ids returns only the source.agent_id. The fleet server had some pretty strict requirements on memory consumption as well, we (mostly Sean) had to tune it to be able to run in very small containers in the past. So selecting large amount of data into memory might not work for fleet server as well.

Update: After some additional considerations, just thinking out loud, it looks like no matter what we need need to "resolve" the agent_ids based on the some wider general criteria, like a query, by policy_id, by all agents, by all agents except these few. The resolution can happen either in:

kibana
fleet server
elasticsearch If elasticsearch itself is not an option, then probably kibana is the next better place.

Update 2: Another possible consideration about doing the agent ids "resolution" inside fleet server is that it would require some coordination with the other fleet servers so only one performs the resolution.

juliaElastic commented 2 years ago

@aleksmaus We are discussing the approach in the RFC, it is an option to keep the agent id resolution logic in kibana, and move it out of the API handler. Do we have these Fleet Server memory constraints documented somewhere?

aleksmaus commented 2 years ago

@aleksmaus We are discussing the approach in the RFC, it is an option to keep the agent id resolution logic in kibana, and move it out of the API handler. Do we have these Fleet Server memory constraints documented somewhere?

@scunningham had to optimize and add some configuration knobs for that back in the days, as far as I remember he had some measurements and fleet server configuration recommendations depending on the number of the agents the fleet server is going to serve. Sean do you have the these numbers anywhere?

juliaElastic commented 2 years ago

Do you mean this recommendation? I was aware of this. What we calculated are a few MBs of agent ids in memory per action, that didn't seem like a big impact on components of this size.

jen-huang commented 1 year ago

@juliaElastic Shall we close this as https://github.com/elastic/kibana/pull/138870 is merged?

juliaElastic commented 1 year ago

@jen-huang there is still an improvement I wanted to do, and also planning to add more tests.

Also I was thinking of the use cases of action validation errors, e.g. agent already assigned to new policy, hosted agent can't be unenrolled, host might not be upgradeable. When taking a bulk action, the agents failing validation do not have an action result, and the action stays "in progress" forever in Agent activity. I think to improve this, it would be great to either:

Save an action result with an error message for those agents that failed validation, so the action would show up as (partially) failed in Agent activity.
Filter out agents failing validation and do not include them in the total count actioned - this might leave users wondering why there are less agents actioned that they clicked on.

WDYT @joshdover @kpollich ?

joshdover commented 1 year ago

Save an action result with an error message for those agents that failed validation, so the action would show up as (partially) failed in Agent activity.

+1 on this. We should have a paper trail of what's happening in the system to be able to show this to the user and for our own debugging purposes.

jlind23 commented 1 year ago

Closing as @juliaElastic fixed both issues.

elastic / kibana

Make Fleet kibana bulk action execution async #141567