[Fleet] Optimize /agents/agent_status API

joshdover commented 2 years ago

We currently do a separate query for each agent status against the agent index to produce the counts that we report on the /agents/agent_status API. When this index is under a lot of ingest load, this can become quite expensive and slow: https://github.com/elastic/kibana/blob/bef0d6a8d37b633d22fe1896005f47cef547f3b0/x-pack/plugins/fleet/server/services/agents/status.ts#L78-L104

We should explore optimizations to this API such as:

First trying any low hanging fruit, such as:
- Removing the sort property when the perPage is 0 (we don't need the results). This may already be optimized by ES though.
- Removing the track_total_hits and rest_total_hits_as_int fields since we're already paging through all of the results.
Doing a single query for agents and then do the bucketing client side in Kibana
Using a runtime field (or potentially indexed) to calculate the status and then do a single terms aggregation on this field rather than separate queries for each status.

We should benchmark these optimizations to ensure they improve performance, especially when there are many agents running (10k+) and some sort of bulk action is happening at the same time (such as a policy change).

elasticmachine commented 2 years ago

Pinging @elastic/fleet (Team:Fleet)

joshdover commented 2 years ago

We need to evaluate impact here before scheduling any work

elastic / kibana

[Fleet] Optimize /agents/agent_status API #136308