Describe the enhancement

We are currently using custom methods to fetch some metrics that are important to have a view on the stability of Filebeat, as I mentioned in #33206.

We would like to see those metrics integrated natively. This would greatly simplify our workflow, and uniformize data collection for Filebeat instances both on baremetal and kubernetes pods.

The proposed enhancement is composed of 3 features that improve visibility on the state of Filebeat. The main point is to be able to tell if Filebeat is working as expected.

Describe a specific use case for the enhancement or feature:

In this section I will describe each metric and the integration we aim for them. The final use case is to integrate those new metrics into our alerting systems to react quickly to any bad state.

New Feature: Hearbeat

First of all, we currently have a cron sending messages to a log file every x minutes. This log file is tailed by Filebeat and the event sent to our infrastructure. This gives us a good overview on the log collection status, by ensuring that logs flows continously. However, it currently requires external components.

We would love to see that directly handled by Filebeat, activated through the configuration for instance.

New Metric: Last Registry Update Time

Following an incident with a stalled Filebeat that was still attempting to send data, a non-updated registry seems to be a good indicator of a bad state that should be investigated ASAP.

We are currently retrieving the last update time through the command stat -c %Z /var/lib/filebeat/registry/filebeat/log.json, exported once again by custom tools.

Once again, having this data directly into Filebeat would be great. For instance integrated in the /stats results, this could look like the following:

{
  "beat": {
    "info": {
      "ephemeral_id": "62e0e489-14c5-4cbd-a87a-f2ebf4643a7a",
      "name": "filebeat",
      "uptime": {
        "ms": 205465136
      },
      "version": "8.3.3"
      "registry_update": {
        "timestamp": 1664896065
      }
    }
  }
}

New Metric: Kafka Connectivity Status

In the same vein as before we are monitoring the connectivity state by parsing the output of filebeat -e -c /etc/filebeat/filebeat.yml test output in order to ensure that all Kafka brokers can be contacted.

This would help tremendously to either have this kind of repetitive check as part of Filebeat, or simply keeping up with the amount of brokers in each state, independently of the configuration.

As before, integrated in the /stats results, this could look like the following:

{
  "libbeat": {
    "output": {
      "events": {},
      "read": {},
      "type": "kafka",
      "write": {},
      "brokers": {
        "pending": 1,
        "failed": 0,
        "connected": 2,
      }
    }
  }
}

Let me know if you need more details.

Best regards, Antoine.

botelastic[bot] commented 2 years ago

This issue doesn't have a Team:<team> label.

botelastic[bot] commented 1 year ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

aveuiller commented 1 year ago

👍

botelastic[bot] commented 1 month ago