elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.62k stars 8.1k forks source link

[Fleet] Implement per-integration health reporting for output #159300

Open jlind23 opened 1 year ago

jlind23 commented 1 year ago

Describe the feature: Agent in unhealthy state status should provide more information on what components are failing or affected such as names
In this issue the reporting was implemented for each and every input but not for their output.

The purpose of this issue is to define how the output status should be displayed. As of today, each input can have an output attached to it.

image

Initial UI design

Figma link | Prototype

With the shipper project ongoing, will this change?

@nimarezainia @cmacknz happy to hear your thoughts on this.

elasticmachine commented 1 year ago

Pinging @elastic/fleet (Team:Fleet)

cmacknz commented 1 year ago

The health reporting data already reports output state independently from inputs. Each input is always paired with an output and today that output is an independent instance of the output configured in the Fleet UI. Each Beat started by agent has its own connection to Elasticsearch for example. Here's an example of what is reported today for an agent running a system/metrics input (note I have manually replaced some numeric states with their names).

{
    "state": 2,
    "message": "Running",
    "components": [
        {
            "id": "system/metrics-default",
            "name": "system/metrics",
            "state": "HEALTHY",
            "message": "Healthy: communicating with pid '34165'",
            "units": [
                {
                    "unit_id": "system/metrics-default-system/metrics-system-4f510cb9-2f4e-4b81-8a19-9969abe1c924",
                    "unit_type": "INPUT",
                    "state": "HEALTHY",
                    "message": "Healthy"
                },
                {
                    "unit_id": "system/metrics-default",
                    "unit_type": "OUTPUT",
                    "state": "HEALTHY",
                    "message": "Healthy"
                }
            ],
        }
      ]
  }

When we introduce the shipper, the biggest change will be introducing an intermediate shipper output for inputs that connect to it with the shipper's output being the "true" output configured in the UI. Here's what this looks like in diagram form:

image

When we configure an agent to use the shipper the set of component health information reported to Fleet will include a new shipper component representing the shipper process. The shipper itself will have a list of input units representing each input that should connect to it, and an output which is the actual output Elasticsearch. Each input still has an output but it is now an internal output to the shipper. The state here still matters as it can fail, but it no longer represents what it did before.

{
  "state": 2,
  "message": "Running",
    "components": [
          {
            "id": "system/metrics-default",
            "name": "system/metrics",
            "state": "HEALTHY",
            "message": "Healthy: communicating with pid '78205'",
            "units": [
                {
                    "unit_id": "system/metrics-default-system/metrics-system-4f510cb9-2f4e-4b81-8a19-9969abe1c924",
                    "unit_type": "INPUT",
                    "state": "HEALTHY",
                    "message": "Healthy"
                },
                {
                    "unit_id": "system/metrics-default",
                    "unit_type": "OUTPUT",
                    "state": "HEALTHY",
                    "message": "Healthy"
                }
            ],
        },
        {
            "id": "shipper-default",
            "name": "shipper",
            "state": "HEALTHY",
            "message": "Failed: pid '78348' exited with code '1'",
            "units": [
                {
                    "unit_id": "system/metrics-default",
                    "unit_type": "INPUT",
                    "state": "HEALTHY",
                    "message": "Healthy"
                },
                {
                    "unit_id": "log-default",
                    "unit_type": "INPUT",
                    "state": "HEALTHY",
                    "message": "Healthy"
                },
                {
                    "unit_id": "shipper-default",
                    "unit_type": "OUTPUT",
                    "state": "HEALTHY",
                    "message": "Healthy"
                }
            ],
        }
    ]
}

The key things to take away from this are:

  1. Without the shipper, each running input has a completely independent output state we need to show. There is a one to many mapping between the output as configured in Fleet and the output statuses reported by the agent.
  2. With the shipper, each input still has an independent output but now it only represents that input's internal connection to the shipper. There is now an additional shipper component which is responsible for communicating with the external output, like Elasticsearch.
  3. One thing I didn't cover in the example of above is that using the shipper is optional. While we transition to the shipper based architecture the two situations above will co-exist with some inputs using their own Elasticsearch output and some using the shipper. We should clearly indicate in the UI which inputs are connected to the shipper so that this is obvious to the user.
kpollich commented 10 months ago

Trying to explore the agent status data in .fleet_agents related to ouputs and I don't really know how to get an example of an unhealthy output. I tried setting up an ES output at a URL that doesn't resolve, expecting agent to eventually report this output was unhealthy. However, my agent is at 18 reconnect attempts (with I assume some level of backoff between retries) without a change in status.

cmacknz commented 10 months ago

I tried setting up an ES output at a URL that doesn't resolve, expecting agent to eventually report this output was unhealthy. However, my agent is at 18 reconnect attempts (with I assume some level of backoff between retries) without a change in status.

The only condition we have hooked up to the output state today is if the output cannot be created at all, which I think would only happen if the configuration were invalid (possibly invalid YAML syntax would do this). https://github.com/elastic/beats/pull/36183

We don't have any health reporting set up for runtime failures of the output, like inability to communicate with ES, wrong credentials, TLS errors, etc.

nimarezainia commented 10 months ago

I tried setting up an ES output at a URL that doesn't resolve, expecting agent to eventually report this output was unhealthy. However, my agent is at 18 reconnect attempts (with I assume some level of backoff between retries) without a change in status.

The only condition we have hooked up to the output state today is if the output cannot be created at all, which I think would only happen if the configuration were invalid (possibly invalid YAML syntax would do this). elastic/beats#36183

We don't have any health reporting set up for runtime failures of the output, like inability to communicate with ES, wrong credentials, TLS errors, etc.

just to clarify, if the output url is no longer reachable, we have no way of reporting that today - correct?

cmacknz commented 10 months ago

Yes, for Beats it is unimplemented. We had an implementation for the v2 Elasticsearch output we were working on but it was never ported over to the existing Beats.

kpollich commented 10 months ago

The only condition we have hooked up to the output state today is if the output cannot be created at all, which I think would only happen if the configuration were invalid (possibly invalid YAML syntax would do this). https://github.com/elastic/beats/pull/36183

We don't have any health reporting set up for runtime failures of the output, like inability to communicate with ES, wrong credentials, TLS errors, etc.

With this in mind, I don't see a reason to prioritize this UI work as we won't be reporting enough output statuses to drive a meaningful output health UI.