elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.06k stars 4.89k forks source link

[Elastic Agent] Beats don't report unhealthy state if cannot connect to output #39801

Open cmacknz opened 7 months ago

cmacknz commented 7 months ago

Originally reported by @juliaElastic:

Discovered during development of remote ES output: https://github.com/elastic/fleet-server/pull/3051#issuecomment-1820608162

I noticed while testing that when the remote output is not accessible, the Agent doesn't go to unhealthy state. The connection errors are logged, but the Agent reports Healthy state on all units.

According to @AndersonQ this is a known issue:

thanks, I had a look at it and talked to them team and there seems to be a known issue the beats not properly reporting or updating their status when an output unit is failing :/

This is how I tested:

Here is the agent diagnostics from my local: elastic-agent-diagnostics-2023-11-22T10-03-07Z-00.zip

Error log:

2023-11-21 11:07:00 {"log.level":"error","@timestamp":"2023-11-21T10:07:00.367Z","message":"Error dialing dial tcp 192.168.64.1:9202: connect: connection refused","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"http/metrics-monitoring","type":"http/metrics"},"log":{"source":"http/metrics-monitoring"},"ecs.version":"1.6.0","address":"192.168.64.1:9202","log.logger":"esclientleg","log.origin":{"file.line":38,"file.name":"transport/logging.go"},"service.name":"metricbeat","network":"tcp","ecs.version":"1.6.0"}

Agent component health

 "components": [
    {
      "id": "system/metrics-default",
      "type": "system/metrics",
      "status": "HEALTHY",
      "message": "Healthy: communicating with pid '119'",
      "units": [
        {
          "id": "system/metrics-default",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "system/metrics-default-system/metrics-system-689768c6-bfc8-4026-b6cb-da91e1b587c9",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "log-default",
      "type": "log",
      "status": "HEALTHY",
      "message": "Healthy: communicating with pid '121'",
      "units": [
        {
          "id": "log-default",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "log-default-logfile-system-689768c6-bfc8-4026-b6cb-da91e1b587c9",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "beat/metrics-monitoring",
      "type": "beat/metrics",
      "status": "HEALTHY",
      "message": "Healthy: communicating with pid '183'",
      "units": [
        {
          "id": "beat/metrics-monitoring",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "beat/metrics-monitoring-metrics-monitoring-beats",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "http/metrics-monitoring",
      "type": "http/metrics",
      "status": "HEALTHY",
      "message": "Healthy: communicating with pid '184'",
      "units": [
        {
          "id": "http/metrics-monitoring-metrics-monitoring-agent",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "http/metrics-monitoring",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    },
    {
      "id": "filestream-monitoring",
      "type": "filestream",
      "status": "HEALTHY",
      "message": "Healthy: communicating with pid '185'",
      "units": [
        {
          "id": "filestream-monitoring-filestream-monitoring-agent",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "filestream-monitoring",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    }
  ],
elasticmachine commented 7 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz commented 7 months ago

We should generalize this to each output, not just Elasticsearch. That likely requires three separate implementations.

We should also likely debounce this implementation. We don't want agents appearing unhealthy because they couldn't connect to Elasticsearch for 100 ms if the problem fixes itself.

cmacknz commented 6 months ago

There are output errors we can detect today that I don't think are shown obviously in the Fleet UI: https://github.com/elastic/elastic-agent/issues/3959#issuecomment-1874146331

nimarezainia commented 6 months ago

There are output errors we can detect today that I don't think are shown obviously in the Fleet UI: #3959 (comment)

@cmacknz those are configuration related are they not? shouldn't be an issue for the fleet managed (which the display would refer to). But agree that if we are able to detect other errors we should certainly display them on the agent details age.

Should this all be included as part of https://github.com/elastic/ingest-dev/issues/1594 ?

cmacknz commented 6 months ago

@cmacknz those are configuration related are they not? shouldn't be an issue for the fleet managed (which the display would refer to). But agree that if we are able to detect other errors we should certainly display them on the agent details age.

Should this all be included as part of https://github.com/elastic/ingest-dev/issues/1594 ?

Agree that the Fleet managed configuration would help avoid this error, but if it did happen there is no where in the Fleet UI to display the error. Looks like https://github.com/elastic/ingest-dev/issues/1594#issuecomment-1761795157 does cover this.

elasticmachine commented 1 month ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)