[Fleet] Fix agent memory query

jillguyonnet commented 8 months ago

Context

Following the investgation carried out for https://github.com/elastic/sdh-beats/issues/4209, the agent memory reported in Fleet's agent table and agent details appears to be about 3-4 times under its actual value. One comparison point is the memory reported by running systemctl status elastic-agent.

The first round of analysis (see details below) suggests that the current query used to calculate the total memory for the agent incorrectly aggregates separate Beat instances together.

Furthermore, the agent memory displayed in the [Elastic Agent] Agent metrics dashboard appears to be similarly undervalued (which is the original issue raised by https://github.com/elastic/sdh-beats/issues/4209). Since the query should be very similar, this should be fixed as well.

It is likely that the agent CPU, which is calculated from the same query, should also be corrected. Note that this metric has also been reported to have unrealistic values (https://github.com/elastic/sdh-beats/issues/3834) and there is an ongoing effort to document how it works (https://github.com/elastic/elastic-agent/pull/4005). It would make sense to do the same for agent memory (either as part of this issue or a followup documentation issue).

Details

Steps to reproduce

Run an Elastic stack with a Fleet server and enroll an agent (easiest might be to use Multipass):

multipass launch --name agent1 --disk 10G
multipass shell agent1
// enroll agent with commands listed in Kibana (replace x86_64 with arm64 if needed)

Once the agent is started, measure its memory with systemctl (from Multipass shell):
```
systemctl status elastic-agent
```
Compare the value with the one reported in Fleet's agent table and the agent's details page: it should be between 3 and 4 times higher.

Analysis

The issue seems to arise from the query used to calculate the agent's memory and CPU. This query computes, for each agent, two values called memory_size_byte_avg and cpu_avg.

In plain words, this query aggregates over the processes of the Elastic Agent (elastic-agent, filebeat and metricbeat), takes the average of system.process.memory.size for each process, and then sums these averages together.

The problem is that elastic_agent.process is not unique per Beat. For example, with a setup as described in the steps above, running sudo elastic-agent status --output=full shows that the system integration and monitoring runs 3 metricbeat instances (system/metrics-default, http/metrics-monitoring, beat/metrics-monitoring) and 2 filebeat instances (filestream-monitoring, log-default):

Output of elastic-agent status --output=full

```yaml ┌─ fleet │ └─ status: (HEALTHY) Connected └─ elastic-agent ├─ status: (HEALTHY) Running ├─ info │ ├─ id: 8d0b2d8a-b3b2-4fa1-8ca5-db5179bd856c │ ├─ version: 8.11.3 │ └─ commit: f4f6fbb3e6c81f37cec57a3c244f009b14abd74f ├─ beat/metrics-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '1739' │ ├─ beat/metrics-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ beat/metrics-monitoring-metrics-monitoring-beats │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ filestream-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '1731' │ ├─ filestream-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ filestream-monitoring-filestream-monitoring-agent │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ http/metrics-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '1744' │ ├─ http/metrics-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ http/metrics-monitoring-metrics-monitoring-agent │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ log-default │ ├─ status: (HEALTHY) Healthy: communicating with pid '1719' │ ├─ log-default │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ log-default-logfile-system-b2274470-459c-4c26-ade3-7ddce7f1c614 │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT └─ system/metrics-default ├─ status: (HEALTHY) Healthy: communicating with pid '1724' ├─ system/metrics-default │ ├─ status: (HEALTHY) Healthy │ └─ type: OUTPUT └─ system/metrics-default-system/metrics-system-b2274470-459c-4c26-ade3-7ddce7f1c614 ├─ status: (HEALTHY) Healthy └─ type: INPUT ```

See also this comment for added context and details.

It is possible (and helpful) to play with the query in the Console in order to tweak the aggregation. Here is a simplified version (memory only):

Agent memory query

``` GET metrics-elastic_agent.*/_search { "size": 0, "query": { "bool": { "must": [ { "term": { "_tier": "data_hot" } }, { "range": { "@timestamp": { "gte": "now-5m" } } }, { "term": { "elastic_agent.id": "" } }, { "bool": { "filter": [ { "bool": { "should": [ { "term": { "data_stream.dataset": "elastic_agent.elastic_agent" } } ] } } ] } } ] } }, "aggs": { "agents": { "terms": { "field": "elastic_agent.id" }, "aggs": { "sum_memory_size": { "sum_bucket": { "buckets_path": "processes>avg_memory_size" } }, "processes": { "terms": { "field": "elastic_agent.process" }, "aggs": { "avg_memory_size": { "avg": { "field": "system.process.memory.size" } } } } } } } } ```

Acceptance criteria

### Tasks
- [ ] Fix the query builder for agent memory
- [ ] Assess the impact of the fix on the agent CPU value
- [ ] Fix the query used by the `[Elastic Agent] Agent metrics` dashboard for agent memory and, if relevant, for agent CPU
- [ ] Follow up on documentation of agent CPU (https://github.com/elastic/elastic-agent/pull/4005) + file a similar PR or issue for agent memory (consider whether these are discoverable enough for the Fleet team)

elasticmachine commented 8 months ago

Pinging @elastic/fleet (Team:Fleet)

ycombinator commented 8 months ago

As mentioned in https://github.com/elastic/elastic-agent/pull/4005#discussion_r1447980001, the processes aggregation should use the field component.id instead of elastic_agent.process.

jillguyonnet commented 8 months ago

As mentioned in elastic/elastic-agent#4005 (comment), the processes aggregation should use the field component.id instead of elastic_agent.process.

👍 FYI I reported a quick comparison of the terms aggregation of component.id vs. elastic_agent.process in https://github.com/elastic/sdh-beats/issues/4209#issuecomment-1880727961 - which had the same output in this case.

cmacknz commented 8 months ago

I see the expected 5 component ids: log-default, system/metrics-default, filestream-monitoring, beat/metrics-monitoring, http/metrics-monitoring.

However, I can only see 3 component ids when querying for the agent memory, and aggregating over these yields the same results as aggregating over processes:

We are missing the monitoring components. http/metrics-monitoring would have to be reporting the metrics collected from itself since it is doing the reporting for the others. It is possible we are not collecting stats for filestream-monitoring and beat/metrics-monitoring which would be incorrect because they aren't free from a resource usage perspective. I will see if I can confirm that the agent is omitting these.

jen-huang commented 8 months ago

@cmacknz Were you able to confirm that these are omitted?

cmacknz commented 8 months ago

Yes, this needs a fix on the agent side.

elastic / kibana