elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
103 stars 4.92k forks source link

[Elastic Agent] Sending monitoring metrics to logs datastream #26758

Closed mostlyjason closed 3 years ago

mostlyjason commented 3 years ago

It looks like Elastic Agent is sending it's own monitoring metrics to the logs datastream. This is not useful as a log message. It should be sending this information to the metrics datastream instead.

Example event:

{
  "_index": ".ds-logs-elastic_agent.metricbeat-default-2021.07.07-000196",
  "_type": "_doc",
  "_id": "DWZtgXoBysd6RRjof1WI",
  "_version": 1,
  "_score": null,
  "fields": {
    "elastic_agent.version": [
      "7.13.2"
    ],
    "monitoring.metrics.beat.cpu.system.ticks": [
      2191340
    ],
    "monitoring.metrics.metricbeat.system.socket.events": [
      8
    ],
    "monitoring.metrics.metricbeat.system.cpu.events": [
      3
    ],
    "host.hostname": [
      "unifi"
    ],
    "host.mac": [
    ],
    "monitoring.metrics.metricbeat.system.network_summary.events": [
      3
    ],
    "monitoring.metrics.libbeat.output.write.bytes": [
      382478
    ],
    "monitoring.metrics.metricbeat.system.network_summary.success": [
      3
    ],
    "host.os.version": [
      "18.04.5 LTS (Bionic Beaver)"
    ],
    "monitoring.metrics.metricbeat.system.service.events": [
      162
    ],
    "agent.name": [
      "unifi"
    ],
    "monitoring.metrics.beat.info.uptime.ms": [
      162121222
    ],
    "monitoring.metrics.metricbeat.system.service.success": [
      162
    ],
    "monitoring.metrics.beat.memstats.memory_alloc": [
      19120104
    ],
    "host.os.type": [
      "linux"
    ],
    "monitoring.metrics.metricbeat.system.entropy.success": [
      3
    ],
    "monitoring.metrics.metricbeat.system.raid.failures": [
      3
    ],
    "monitoring.metrics.metricbeat.system.uptime.success": [
      3
    ],
    "input.type": [
      "log"
    ],
    "monitoring.metrics.libbeat.pipeline.clients": [
      17
    ],
    "agent.hostname": [
      "unifi"
    ],
    "monitoring.metrics.libbeat.pipeline.events.total": [
      248
    ],
    "monitoring.metrics.libbeat.pipeline.events.active": [
      0
    ],
    "host.architecture": [
      "x86_64"
    ],
    "monitoring.metrics.metricbeat.system.socket.success": [
      8
    ],
    "agent.id": [
      "0d376a06-dccc-4fb9-94ea-d8c5ecea380a"
    ],
    "host.containerized": [
      false
    ],
    "monitoring.metrics.system.load.norm.15": [
      0.11
    ],
    "monitoring.metrics.beat.cpu.total.value": [
      4682010
    ],
    "monitoring.metrics.metricbeat.system.load.events": [
      3
    ],
    "log.logger": [
      "monitoring"
    ],
    "monitoring.metrics.libbeat.pipeline.queue.acked": [
      248
    ],
    "host.ip": [
    ],
    "agent.type": [
      "filebeat"
    ],
    "monitoring.metrics.metricbeat.system.diskio.events": [
      21
    ],
    "monitoring.metrics.beat.handles.open": [
      19
    ],
    "monitoring.metrics.metricbeat.system.diskio.success": [
      21
    ],
    "monitoring.metrics.beat.cpu.total.ticks": [
      4682010
    ],
    "elastic_agent.snapshot": [
      false
    ],
    "host.id": [
      "a4719423bdb94e1e80df8ea652c5dd59"
    ],
    "monitoring.metrics.libbeat.output.events.active": [
      0
    ],
    "monitoring.metrics.system.load.5": [
      0.12
    ],
    "monitoring.metrics.beat.memstats.memory_total": [
      613619360744
    ],
    "elastic_agent.id": [
      "3cace087-149e-4507-b0ee-5e6a3afdf14a"
    ],
    "monitoring.metrics.metricbeat.system.process.success": [
      18
    ],
    "host.os.codename": [
      "bionic"
    ],
    "monitoring.metrics.system.load.1": [
      0.27
    ],
    "monitoring.metrics.beat.memstats.rss": [
      39579648
    ],
    "monitoring.metrics.metricbeat.system.load.success": [
      3
    ],
    "log.origin.file.name": [
      "log/log.go"
    ],
    "@timestamp": [
      "2021-07-07T14:44:33.126Z"
    ],
    "host.os.platform": [
      "ubuntu"
    ],
    "log.file.path": [
      "/opt/Elastic/Agent/data/elastic-agent-686ba4/logs/default/metricbeat-json.log"
    ],
    "data_stream.dataset": [
      "elastic_agent.metricbeat"
    ],
    "agent.ephemeral_id": [
      "8361cc5c-d090-45a6-a564-3ccc134c261b"
    ],
    "monitoring.metrics.metricbeat.system.memory.success": [
      3
    ],
    "monitoring.metrics.metricbeat.system.uptime.events": [
      3
    ],
    "monitoring.metrics.metricbeat.system.network.events": [
      12
    ],
    "monitoring.metrics.metricbeat.system.cpu.success": [
      3
    ],
    "monitoring.metrics.libbeat.pipeline.events.published": [
      248
    ],
    "monitoring.metrics.beat.memstats.gc_next": [
      21159136
    ],
    "monitoring.metrics.metricbeat.system.process_summary.success": [
      3
    ],
    "host.os.name": [
      "Ubuntu"
    ],
    "log.level": [
      "info"
    ],
    "monitoring.metrics.beat.runtime.goroutines": [
      112
    ],
    "host.name": [
      "unifi"
    ],
    "monitoring.metrics.beat.cpu.system.time.ms": [
      376
    ],
    "monitoring.metrics.metricbeat.system.socket_summary.success": [
      3
    ],
    "monitoring.metrics.metricbeat.system.network.success": [
      12
    ],
    "log.offset": [
      220314
    ],
    "data_stream.type": [
      "logs"
    ],
    "monitoring.metrics.libbeat.output.events.total": [
      248
    ],
    "monitoring.metrics.beat.handles.limit.soft": [
      1024
    ],
    "ecs.version": [
      "1.8.0"
    ],
    "agent.version": [
      "7.13.2"
    ],
    "monitoring.metrics.libbeat.output.events.batches": [
      9
    ],
    "host.os.family": [
      "debian"
    ],
    "monitoring.metrics.metricbeat.system.raid.events": [
      3
    ],
    "monitoring.metrics.metricbeat.system.entropy.events": [
      3
    ],
    "monitoring.metrics.beat.cpu.user.time.ms": [
      490
    ],
    "monitoring.metrics.beat.cgroup.memory.mem.usage.bytes": [
      81920
    ],
    "monitoring.metrics.beat.cgroup.cpuacct.total.ns": [
      1159162390
    ],
    "monitoring.metrics.system.load.norm.1": [
      0.27
    ],
    "monitoring.metrics.system.load.norm.5": [
      0.12
    ],
    "monitoring.metrics.beat.cpu.user.ticks": [
      2490670
    ],
    "monitoring.ecs.version": [
      "1.6.0"
    ],
    "monitoring.metrics.metricbeat.system.process_summary.events": [
      3
    ],
    "monitoring.metrics.metricbeat.system.memory.events": [
      3
    ],
    "monitoring.metrics.libbeat.output.read.bytes": [
      62838
    ],
    "monitoring.metrics.system.load.15": [
      0.11
    ],
    "monitoring.metrics.beat.cpu.total.time.ms": [
      866
    ],
    "monitoring.metrics.beat.info.ephemeral_id": [
      "c8914934-42d8-4d7a-9ae5-bda5db39d2e0"
    ],
    "monitoring.metrics.libbeat.config.module.running": [
      17
    ],
    "host.os.kernel": [
      "4.15.0-147-generic"
    ],
    "monitoring.metrics.libbeat.output.events.acked": [
      248
    ],
    "monitoring.metrics.metricbeat.system.process.events": [
      18
    ],
    "log.origin.file.line": [
      144
    ],
    "monitoring.metrics.metricbeat.system.socket_summary.events": [
      3
    ],
    "data_stream.namespace": [
      "default"
    ],
    "message": [
      "Non-zero metrics in the last 30s"
    ],
    "monitoring.metrics.beat.handles.limit.hard": [
      4096
    ],
    "event.dataset": [
      "elastic_agent.metricbeat"
    ]
  },
  "sort": [
    1625669073126
  ]
}

For confirmed bugs, please report:

elasticmachine commented 3 years ago

Pinging @elastic/agent (Team:Agent)

ruflin commented 3 years ago

Beats logs its metrics every 30s. This was built before there was even a metrics endpoint in Beats. Elastic Agent behaves as expected here as it just "tails" the log file. Instead we should likely disable the logging of these metrics which I think can be done through a config option.

I remove the bug label as I don't consider this a bug.

michel-laterman commented 3 years ago

We'll add a config option to disable logging metrics. However the default behaviour will not change as we may need access to these metrics (through the log files) to help us debug issues.

michel-laterman commented 3 years ago

@ruflin, my PR adds a setting that can be used in stand-alone mode to stop the beats from emitting metrics. If we want this to be enabled in fleet mode, the setting should be passed as part of the policy, and set through Kibana. I'm not sure what project (fleet/kibana) is responsible for doing that/where to make the issue. If we want a short-term work around for fleet mode, we can enable the agent to pick it up as part of fleet.yml, however this would still require a user to edit the file manually.

ruflin commented 3 years ago

I don't think we should build any short term hacks around this and I'm also not sure on the urgency. @jen-huang As soon as there is a config option, we could like use this as the default in the policy?

jen-huang commented 3 years ago

Each agent policy needs to explicitly declare what agent monitoring options should be enabled, by default it is stored as ["logs", "metrics"] on the policy. We can certainly add this new monitoring option to be enabled by default, but it will only kick in for new agent policies.

We could add a migration for 7.15 to add it to existing policies though. But I think we will want some conditional logic there, I suppose log_metrics should be enabled when the policy also has metrics monitoring enabled?

michel-laterman commented 3 years ago

They are decoupled it's possible to enable log_metrics and not metrics. However, we have an agent.logging.metrics.enabled setting that we may reuse to pass to beats instead of adding the new one.

The default settings will not change from the current behaviour (metrics appear in logs).

michel-laterman commented 3 years ago

We are reusing agent.logging.metrics.enabled instead of introducing another setting. The default (true) has not changed. If it's set to false then the elastic-agent and all beats running under it will not have the metrics entries appear in the logs.