elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
127 stars 137 forks source link

Stop collecting the beat state metricset as part of agent monitoring #4153

Closed cmacknz closed 5 months ago

cmacknz commented 8 months ago

Our agent monitoring implementation currently uses the beat Metricbeat module to monitor Beat subprocesses. We collect both the stats and state metricsets.

https://github.com/elastic/elastic-agent/blob/b39b9af521fcbf1fcae6bab14762b0a120febdb7/internal/pkg/agent/application/monitoring/v1_monitor.go#L617-L625

It seems to me that nothing actually uses the data from the state metricset. We don't map the fields in the Elastic Agent integration. I believe we can remove this metricset and stop pointlessly storing this data for every Beat process we start.

We currently store both the state and stats metricset in the same datastream, and as such include the metricset name as a TSDB dimension which could probably be removed after this change.

https://github.com/elastic/integrations/blob/a2c55c4cbf752e0490f9fe2d3e68698517c7b74d/packages/elastic_agent/data_stream/elastic_agent_metrics/fields/ecs.yml#L21-L23

- name: metricset.name
  type: keyword
  dimension: true

Acceptance Criteria:

elasticmachine commented 8 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

nimarezainia commented 5 months ago

@pchila thanks for your diligence on this issue. Would it be possible to have a benchmark on what the savings we could expect from this change?

cc: @pierrehilbert

ycombinator commented 5 months ago

Reopening this issue as the second part of the acceptance criteria isn't actually done yet AFAICT:

The data storage savings after removing this metricset are calculated and included in the release notes

Also related to @nimarezainia's question in the previous comment.

pchila commented 5 months ago

@pchila thanks for your diligence on this issue. Would it be possible to have a benchmark on what the savings we could expect from this change?

@cmacknz did a quick check on the data savings here on the PR https://github.com/elastic/elastic-agent/pull/4579#issuecomment-2060208711

I will re-run 2 versions of agent (with and without the change) and check the index size and document count

ycombinator commented 5 months ago

@cmacknz did a quick check on the data savings here on the PR #4579 (comment)

I will re-run 2 versions of agent (with and without the change) and check the index size and document count

Thanks. Could you make a small PR to update https://github.com/elastic/elastic-agent/blob/fd7984b1d70dc968ba67fb8f4221905e508d6a06/changelog/fragments/1713257367-Remove-beat-state-metricset-from-elastic-agent-monitoring.yaml#L19 with these savings numbers?

pchila commented 5 months ago

@nimarezainia @ycombinator Re-measured index size difference between commit 1e88a9448f93499fea0e59672de9d6c80edc53c4 (commit just before the change) and commit 0d31445bfd5bdb108a5abf0b1cec4fe9fd3c3a1b (merge commit of the related PR) for a 10 min period after startup.

In both cases I used a policy that included the System Integration and agent logs and metrics collection. image image

Here's the sizes of the reindexed documents image

Document count for metrics-elastic_agent.filebeat-* and metrics-elastic_agent.metricbeat-disksize.baseline is down by 50% (as expected removing half of the metricsets) with a size on disk gain of ~13% for both indices

I am gonna put up a small PR with the changelog patching and link it to this issue

cmacknz commented 5 months ago

In that same PR, can you add something under the doc directory describing how to reproduce these test results?

pchila commented 5 months ago

@cmacknz I used a script that is part of PR #4633 for extracting and reindexing logs and metrics but it's not merged yet

cmacknz commented 5 months ago

Sure, doesn't matter when or how it gets documented then, as long as we have a way to remember what we did if we want to re-evaluate this again later.

strawgate commented 5 months ago

Isn't the number of metrics produced dependent on the number of components running under agent? i.e. something like x document per beat per interval? so the % savings depends on the number of deployed integrations/managed beats?

cmacknz commented 5 months ago

That is correct yes, more complex configurations will see greater savings. I assume @pchila likely tested this with the default system integration installed, I will comment on the changelog entry.

pchila commented 5 months ago

@strawgate @cmacknz edited my comment adding clarification on what policy I used for the test. This is the reason why I expressed the savings in % as the absolute numbers will scale with the number of impacted indices

elasticmachine commented 5 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)