Open jillguyonnet opened 8 months ago
Pinging @elastic/fleet (Team:Fleet)
As mentioned in https://github.com/elastic/elastic-agent/pull/4005#discussion_r1447980001, the processes
aggregation should use the field component.id
instead of elastic_agent.process
.
As mentioned in elastic/elastic-agent#4005 (comment), the
processes
aggregation should use the fieldcomponent.id
instead ofelastic_agent.process
.
👍 FYI I reported a quick comparison of the terms
aggregation of component.id
vs. elastic_agent.process
in https://github.com/elastic/sdh-beats/issues/4209#issuecomment-1880727961 - which had the same output in this case.
I see the expected 5 component ids: log-default, system/metrics-default, filestream-monitoring, beat/metrics-monitoring, http/metrics-monitoring.
However, I can only see 3 component ids when querying for the agent memory, and aggregating over these yields the same results as aggregating over processes:
We are missing the monitoring components. http/metrics-monitoring
would have to be reporting the metrics collected from itself since it is doing the reporting for the others. It is possible we are not collecting stats for filestream-monitoring
and beat/metrics-monitoring
which would be incorrect because they aren't free from a resource usage perspective. I will see if I can confirm that the agent is omitting these.
@cmacknz Were you able to confirm that these are omitted?
Yes, this needs a fix on the agent side.
Context
Following the investgation carried out for https://github.com/elastic/sdh-beats/issues/4209, the agent memory reported in Fleet's agent table and agent details appears to be about 3-4 times under its actual value. One comparison point is the memory reported by running
systemctl status elastic-agent
.The first round of analysis (see details below) suggests that the current query used to calculate the total memory for the agent incorrectly aggregates separate Beat instances together.
Furthermore, the agent memory displayed in the
[Elastic Agent] Agent metrics
dashboard appears to be similarly undervalued (which is the original issue raised by https://github.com/elastic/sdh-beats/issues/4209). Since the query should be very similar, this should be fixed as well.It is likely that the agent CPU, which is calculated from the same query, should also be corrected. Note that this metric has also been reported to have unrealistic values (https://github.com/elastic/sdh-beats/issues/3834) and there is an ongoing effort to document how it works (https://github.com/elastic/elastic-agent/pull/4005). It would make sense to do the same for agent memory (either as part of this issue or a followup documentation issue).
Details
Steps to reproduce
systemctl
(from Multipass shell):Analysis
The issue seems to arise from the query used to calculate the agent's memory and CPU. This query computes, for each agent, two values called
memory_size_byte_avg
andcpu_avg
.In plain words, this query aggregates over the processes of the Elastic Agent (elastic-agent, filebeat and metricbeat), takes the average of system.process.memory.size for each process, and then sums these averages together.
The problem is that
elastic_agent.process
is not unique per Beat. For example, with a setup as described in the steps above, runningsudo elastic-agent status --output=full
shows that thesystem
integration and monitoring runs 3 metricbeat instances (system/metrics-default
,http/metrics-monitoring
,beat/metrics-monitoring
) and 2 filebeat instances (filestream-monitoring
,log-default
):Output of elastic-agent status --output=full
```yaml ┌─ fleet │ └─ status: (HEALTHY) Connected └─ elastic-agent ├─ status: (HEALTHY) Running ├─ info │ ├─ id: 8d0b2d8a-b3b2-4fa1-8ca5-db5179bd856c │ ├─ version: 8.11.3 │ └─ commit: f4f6fbb3e6c81f37cec57a3c244f009b14abd74f ├─ beat/metrics-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '1739' │ ├─ beat/metrics-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ beat/metrics-monitoring-metrics-monitoring-beats │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ filestream-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '1731' │ ├─ filestream-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ filestream-monitoring-filestream-monitoring-agent │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ http/metrics-monitoring │ ├─ status: (HEALTHY) Healthy: communicating with pid '1744' │ ├─ http/metrics-monitoring │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ http/metrics-monitoring-metrics-monitoring-agent │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT ├─ log-default │ ├─ status: (HEALTHY) Healthy: communicating with pid '1719' │ ├─ log-default │ │ ├─ status: (HEALTHY) Healthy │ │ └─ type: OUTPUT │ └─ log-default-logfile-system-b2274470-459c-4c26-ade3-7ddce7f1c614 │ ├─ status: (HEALTHY) Healthy │ └─ type: INPUT └─ system/metrics-default ├─ status: (HEALTHY) Healthy: communicating with pid '1724' ├─ system/metrics-default │ ├─ status: (HEALTHY) Healthy │ └─ type: OUTPUT └─ system/metrics-default-system/metrics-system-b2274470-459c-4c26-ade3-7ddce7f1c614 ├─ status: (HEALTHY) Healthy └─ type: INPUT ```See also this comment for added context and details.
It is possible (and helpful) to play with the query in the Console in order to tweak the aggregation. Here is a simplified version (memory only):
Agent memory query
``` GET metrics-elastic_agent.*/_search { "size": 0, "query": { "bool": { "must": [ { "term": { "_tier": "data_hot" } }, { "range": { "@timestamp": { "gte": "now-5m" } } }, { "term": { "elastic_agent.id": "Acceptance criteria