Clarify documentation of beats output errors

jsoriano commented 3 years ago

There are several metrics reporting output errors from the beat module, clarify their meaning in the docs, focusing on how worrisome these errors are for users.

For example it is not clear if a non-zero beat.stats.libbeat.output.write.errors implied some data loss, though it probably didn't if beat.stats.libbeat.output.events.dropped is zero.

For confirmed bugs, please report:

Discuss Forum URL: https://discuss.elastic.co/t/what-does-output-errors-mean-in-beats-monitoring-in-kibana/283365

elasticmachine commented 3 years ago

Pinging @elastic/integrations (Team:Integrations)

elasticmachine commented 3 years ago

Pinging @elastic/stack-monitoring (Stack monitoring)

nkakouros commented 3 years ago

The user facing part of this (i.e. the kibana interface) could also benefit from a bit more clarification. For instance, in the image below, what should I be worried about? Do I have dropped events that got lost forever without being logged? How does "Output Errors" relate to "Fail Rates"?

tom-x1 commented 3 years ago

After reading the discussion and the docs I still don't know about the exact meaning of those metrics in the stack monitoring. I think we all agree that one of the most important questions a user has is: "Do I lose data or not?" I still can't answer this question. For example: I'm using packetbeat to gather DNS traffic and from time to time the "Failed in Pipeline" counter is jumping from 0 to 1000 and as stated above I'm asking myself: Do I lose those DNS queries or not?

Also I don't really get the difference between the two graphs "Fail Rates" and "Output Errors". To me it sounds like they are both doing the same, but they probably don't?

sachin-frayne commented 2 years ago

I would also like to know the meaning of these values, especially beat.stats.libbeat.output.events.dropped, if that is larger than 0 am I losing data?

6fears7 commented 2 years ago

Inside the code for Libbeat:

    //
    // Output event stats
    //
    batches *monitoring.Uint // total number of batches processed by output
    events  *monitoring.Uint // total number of events processed by output

    acked      *monitoring.Uint // total number of events ACKed by output
    failed     *monitoring.Uint // total number of events failed in output
    active     *monitoring.Uint // events sent and waiting for ACK/fail from output
    duplicates *monitoring.Uint // events sent and waiting for ACK/fail from output
    dropped    *monitoring.Uint // total number of invalid events dropped by the output
    tooMany    *monitoring.Uint // total number of too many requests replies from output

    //
    // Output network connection stats
    //
    writeBytes  *monitoring.Uint // total amount of bytes written by output
    writeErrors *monitoring.Uint // total number of errors on write

    readBytes  *monitoring.Uint // total amount of bytes read
    readErrors *monitoring.Uint // total number of errors while waiting for response on output
}

When you're querying in Elastic for results of Libbeat (see below), the Output Errors is derived from the measured delta between the initial timestamp's readErrors + writeErrors and the latest timestamp's readErrors + writeErrors. According to the code commentary then, Output Errors is the number of network packets experiencing errors.

The below example is utilizing apm-server as the beat type, but you can replace it to suit your needs.

GET _search

{
    "query": {
      "bool": {
        "filter": [
          {
            "bool": {
              "should": [
                {
                  "term": {
                    "data_stream.dataset": "beats.stats"
                  }
                },
                {
                  "term": {
                    "metricset.name": "stats"
                  }
                },
                {
                  "term": {
                    "type": "beats_stats"
                  }
                }
              ]
            }
          },
          {
            "term": {
              "cluster_uuid": "CLUSTER_UUID"
            }
          },
          {
            "range": {
              "beats_stats.timestamp": {
                "format": "epoch_millis",
                "gte": 1665053615330,
                "lte": 1665054515330
              }
            }
          },
          {
            "bool": {
              "must": {
                "term": {
                  "beats_stats.beat.type": "apm-server"
                }
              }
            }
          }
        ]
      }
    },
    "collapse": {
      "field": "beats_stats.metrics.beat.info.ephemeral_id",
      "inner_hits": {
        "name": "earliest",
        "size": 1,
        "sort": [
          {
            "beats_stats.timestamp": {
              "order": "asc",
              "unmapped_type": "long"
            }
          },
          {
            "@timestamp": {
              "order": "asc",
              "unmapped_type": "long"
            }
          }
        ]
      }
    },
    "sort": [
      {
        "beats_stats.beat.uuid": {
          "order": "asc",
          "unmapped_type": "long"
        }
      },
      {
        "timestamp": {
          "order": "desc",
          "unmapped_type": "long"
        }
      }
    ]
  }

zez3 commented 2 years ago

According to the code commentary then, Output Errors is the number of network packets experiencing errors.

@6fears7 So, we have event loss or not? I don't have any other errors, like 429

zez3 commented 2 years ago

I've added this beat.stats.libbeat.output.events.dropped to my Agent dashboard

I get an daily alarm telling me that,

eg.

monitor filebeat drops - * is in a state of ALERT

Reason:
beat.stats.libbeat.output.events.dropped is 3,313,220.83422 in the last 1 day for all hosts. Alert when > 0.

That 3 mil./day events are kind of a lot from my customer point of view

tom-x1 commented 2 years ago

@6fears7 Well that still does not answer if I have lost events or not. An "error" doesn't necessarily mean that I have lost events, I don't know if after those errors occur some kind of mechanism tries to process those faulty events again or not. Telling the users just "Error" without clearly letting them know the consequences of it does not help. A graph which says "Events lost" would be clear and helpful in my opinion.

6fears7 commented 2 years ago

I've added this beat.stats.libbeat.output.events.dropped to my Agent dashboard
* I get an daily alarm telling me that,
eg.
monitor filebeat drops - * is in a state of ALERT

Reason:
beat.stats.libbeat.output.events.dropped is 3,313,220.83422 in the last 1 day for all hosts. Alert when > 0.
That 3 mil./day events are kind of a lot from my customer point of view

In client.go, we are given an initial struct of:

type bulkResultStats struct {
    acked        int // number of events ACKed by Elasticsearch
    duplicates   int // number of events failed with `create` due to ID already being indexed
    fails        int // number of failed events (can be retried)
    nonIndexable int // number of failed events (not indexable)
    tooMany      int // number of events receiving HTTP 429 Too Many Requests
}

Later, we see what constitutes a drop:

failed := len(failedEvents)
    span.Context.SetLabel("events_failed", failed)
    if st := client.observer; st != nil {
        dropped := stats.nonIndexable 
        duplicates := stats.duplicates
        acked := len(data) - failed - dropped - duplicates

        st.Acked(acked)
        st.Failed(failed)
        st.Dropped(dropped)
        st.Duplicate(duplicates)
        st.ErrTooMany(stats.tooMany)
    }

So dropped events would be those of the nonIndexable type.

In order to determine what a "nonIndexable" type is, the code iterates through the Bulk results:

... 
        if status < 500 { 
            if status == http.StatusTooManyRequests {
                stats.tooMany++
            } else {
                // hard failure, apply policy action
                result, _ := data[i].Content.Meta.HasKey(dead_letter_marker_field)
                if result {
                    stats.nonIndexable++
                    client.log.Errorf("Can't deliver to dead letter index event %#v (status=%v): %s", data[i], status, msg)
                    // poison pill - this will clog the pipeline if the underlying failure is non transient.
                } else if client.NonIndexableAction == dead_letter_index {
                    client.log.Warnf("Cannot index event %#v (status=%v): %s, trying dead letter index", data[i], status, msg)
                    if data[i].Content.Meta == nil {
                        data[i].Content.Meta = common.MapStr{
                            dead_letter_marker_field: true,
                        }
                    } else {
                        data[i].Content.Meta.Put(dead_letter_marker_field, true)
                    }
                    data[i].Content.Fields = common.MapStr{
                        "message":       data[i].Content.Fields.String(),
                        "error.type":    status,
                        "error.message": string(msg),
                    }
                } else { // drop
                    stats.nonIndexable++
                    client.log.Warnf("Cannot index event %#v (status=%v): %s, dropping event!", data[i], status, msg)
                    continue
                }

So any event that cannot be indexed (as an example here) will become tagged under the dead_letter_marker_field and dropped from the pipeline.

To answer your question, I do believe that those events are worth looking into. My first guess would be to check how the pipeline is being parsed and if there's some Grok that needs to be done.

zez3 commented 2 years ago

Thank you for clarifying @6fears7

I use official pipelines from Fleet integrations but also custom ones. I also use dissect and the rest of filebeat processor.(dns is still missing from elasticsearch processor) @sachin-frayne Can you help now to further identify those events and/or check the pipelines? as sugested above.

zez3 commented 2 years ago

Here is what I see in my Agent filbeat metrics

{ "_index": ".ds-metrics-elastic_agent.filebeat-ece-2022.09.28-000017", "_id": "H4Apu4MBF-Kn07JyC-C-", "_version": 1, "_score": 0, "_source": { "agent": { "name": "my_host.my.dom", "id": "4ba5a4aa-848d-4435-a13e-ace9584cddaa", "ephemeral_id": "2a66e283-d27a-433b-8b39-ae7105878348", "type": "metricbeat", "version": "8.4.2" }, "@timestamp": "2022-10-09T05:12:51.923Z", "ecs": { "version": "8.0.0" }, "service": { "address": "http://unix/stats", "name": "beat", "type": "beat" }, "data_stream": { "namespace": "ece", "type": "metrics", "dataset": "elastic_agent.filebeat" }, "beat": { "elasticsearch": { "cluster": { "id": "tOaF05W8QCG1Q3z25K4WZQ" } }, "stats": { "handles": { "limit": { "hard": 524288, "soft": 1024 }, "open": 148 }, "system": { "load": { "1": 7.16, "5": 8.6, "15": 8.53, "norm": { "1": 0.0559, "5": 0.0672, "15": 0.0666 } }, "cpu": { "cores": 128 } }, "beat": { "host": "my_host.my.dom", "name": "my_host.my.dom", "type": "filebeat", "uuid": "a97378a4-c48f-45c2-901b-7b8b5f6b6502", "version": "8.4.2" }, "runtime": { "goroutines": 289 }, "cpu": { "total": { "ticks": 13560390, "time": { "ms": 13560390 }, "value": 13560390 }, "system": { "ticks": 1185830, "time": { "ms": 1185830 } }, "user": { "ticks": 12374560, "time": { "ms": 12374560 } } }, "memstats": { "memory": { "total": 1581247076272, "alloc": 286642512 }, "rss": 546992128, "gc_next": 322126080 }, "libbeat": { "output": { "read": { "bytes": 515468121, "errors": 0 }, "type": "elasticsearch", "write": { "bytes": 33985770829, "errors": 0 }, "events": { "batches": 479155, "duplicates": 71111, "total": 23936376, "dropped": 23671, "toomany": 0, "active": 50, "failed": 0, "acked": 23841544 } }, "pipeline": { "clients": 20, "queue": { "acked": 23936326 }, "events": { "total": 23940457, "filtered": 12, "dropped": 0, "active": 4119, "failed": 0, "published": 23940442, "retry": 2048 } }, "config": { "running": 0, "stops": 0, "starts": 20, "reloads": 0 } }, "cgroup": { "memory": { "mem": { "usage": { "bytes": 2462949376 }, "limit": { "bytes": 53687091200 } }, "id": "elastic-agent.service" }, "cpu": { "cfs": { "period": { "us": 100000 }, "quota": { "us": 0 } }, "stats": { "periods": 0, "throttled": { "ns": 0, "periods": 0 } }, "id": "elastic-agent.service" }, "cpuacct": { "total": { "ns": 4050343270691067 }, "id": "elastic-agent.service" } }, "uptime": { "ms": 27149727 }, "info": { "name": "filebeat", "ephemeral_id": "09bd12a2-0e46-4f6a-9c34-cf67b4a8ae42", "version": "8.4.2", "uptime": { "ms": 27149727 } } }, "id": "a97378a4-c48f-45c2-901b-7b8b5f6b6502", "type": "filebeat" }, "host": { "hostname": "my_host.my.dom", "os": { "kernel": "5.4.0-104-generic", "codename": "focal", "name": "Ubuntu", "type": "linux", "family": "debian", "version": "20.04.5 LTS (Focal Fossa)", "platform": "ubuntu" }, "containerized": false, "ip": [ "fe80::1ac0:4dff:fe7a:5796", "1.1.1.1", "fe80::ba59:9fff:fed2:222", "1.1.1.1", "192.168.43.1", "fe80::42:73ff:fe00:5c0a", "10.0.3.1", "fe80::c5e:4ff:feff:61bb", "fe80::b4c7:89ff:fe39:ef4d", "fe80::8014:f6ff:fe19:bd2a", "fe80::f49b:71ff:fef1:1993", "fe80::9454:a4ff:fee7:cdbf", "fe80::dcd6:6ff:fee9:c14d", "fe80::34fa:29ff:fe12:af8a", "fe80::866:4ff:fe83:8873", "fe80::7059:d7ff:fe3d:76a4", "fe80::446d:53ff:febc:ede3", "fe80::841e:e9ff:feda:42e", "fe80::6086:33ff:feee:6f81", "fe80::d45a:b8ff:fe91:c903", "fe80::14a4:11ff:fed1:a6a5", "fe80::dc56:88ff:fe6d:a3ba", "fe80::2c5d:f2ff:fec6:b0eb", "fe80::2803:d9ff:fe72:8edd", "fe80::8ce0:aff:fef4:f9df", "fe80::2c9d:77ff:fed3:b27a", "fe80::8875:6ff:fe86:64f9", "fe80::bc5e:83ff:feb7:348a", "fe80::d483:27ff:fecb:8415", "fe80::c0ba:99ff:fe90:cedf", "fe80::ac20:56ff:fe39:da07", "fe80::7c8b:dfff:fe76:c1ee", "fe80::7016:58ff:fef8:e362", "fe80::5482:78ff:fefc:2464", "fe80::586d:18ff:fe90:f306" ], "name": "my_host.my.dom", "id": "06d09b925a0246dcbd044c0036e01c9c", "mac": [ "00:16:3e:00:00:00", "02:42:73:00:5c:0a", "0a:66:04:83:88:73", "0e:5e:04:ff:61:bb", "16:a4:11:d1:a6:a5", "18:c0:4d:7a:57:96", "18:c0:4d:7a:57:97", "2a:03:d9:72:8e:dd", "2a:f8:b1:d1:37:35", "2e:5d:f2:c6:b0:eb", "2e:9d:77:d3:b2:7a", "36:fa:29:12:af:8a", "46:6d:53:bc:ed:e3", "56:82:78:fc:24:64", "5a:6d:18:90:f3:06", "62:86:33:ee:6f:81", "72:16:58:f8:e3:62", "72:59:d7:3d:76:a4", "7e:8b:df:76:c1:ee", "82:14:f6:19:bd:2a", "86:1e:e9:da:04:2e", "8a:75:06:86:64:f9", "8e:e0:0a:f4:f9:df", "96:54:a4:e7:cd:bf", "ae:20:56:39:da:07", "b6:c7:89:39:ef:4d", "b8:59:9f:d2:02:22", "be:5e:83:b7:34:8a", "c2:ba:99:90:ce:df", "d6:5a:b8:91:c9:03", "d6:83:27:cb:84:15", "de:56:88:6d:a3:ba", "de:d6:06:e9:c1:4d", "f6:9b:71:f1:19:93" ], "architecture": "x86_64" }, "elastic_agent": { "id": "4ba5a4aa-848d-4435-a13e-ace9584cddaa", "version": "8.4.2", "snapshot": false }, "metricset": { "period": 10000, "name": "stats" }, "event": { "duration": 28331585, "agent_id_status": "verified", "ingested": "2022-10-09T05:12:52Z", "module": "beat", "dataset": "elastic_agent.filebeat" } }, "fields": { "beat.stats.cgroup.cpu.id": [ "elastic-agent.service" ], "beat.stats.libbeat.output.write.bytes": [ 33985770829 ], "beat.stats.libbeat.pipeline.events.failed": [ 0 ], "elastic_agent.version": [ "8.4.2" ], "beat.stats.libbeat.pipeline.events.published": [ 23940442 ], "host.os.name.text": [ "Ubuntu" ], "host.hostname": [ "my_host.my.dom" ], "host.mac": [ "00:16:3e:00:00:00", "02:42:73:00:5c:0a", "0a:66:04:83:88:73", "0e:5e:04:ff:61:bb", "16:a4:11:d1:a6:a5", "18:c0:4d:7a:57:96", "18:c0:4d:7a:57:97", "2a:03:d9:72:8e:dd", "2a:f8:b1:d1:37:35", "2e:5d:f2:c6:b0:eb", "2e:9d:77:d3:b2:7a", "36:fa:29:12:af:8a", "46:6d:53:bc:ed:e3", "56:82:78:fc:24:64", "5a:6d:18:90:f3:06", "62:86:33:ee:6f:81", "72:16:58:f8:e3:62", "72:59:d7:3d:76:a4", "7e:8b:df:76:c1:ee", "82:14:f6:19:bd:2a", "86:1e:e9:da:04:2e", "8a:75:06:86:64:f9", "8e:e0:0a:f4:f9:df", "96:54:a4:e7:cd:bf", "ae:20:56:39:da:07", "b6:c7:89:39:ef:4d", "b8:59:9f:d2:02:22", "be:5e:83:b7:34:8a", "c2:ba:99:90:ce:df", "d6:5a:b8:91:c9:03", "d6:83:27:cb:84:15", "de:56:88:6d:a3:ba", "de:d6:06:e9:c1:4d", "f6:9b:71:f1:19:93" ], "beat.stats.cpu.user.time.ms": [ 12374560 ], "service.type": [ "beat" ], "beat.stats.beat.type": [ "filebeat" ], "beat.stats.libbeat.config.starts": [ 20 ], "host.os.version": [ "20.04.5 LTS (Focal Fossa)" ], "beat.stats.cgroup.cpu.cfs.quota.us": [ 0 ], "agent.name": [ "my_host.my.dom" ], "beat.stats.libbeat.config.stops": [ 0 ], "beat.stats.cgroup.memory.id": [ "elastic-agent.service" ], "event.agent_id_status": [ "verified" ], "beat.stats.cpu.total.value": [ 13560390 ], "beat.stats.libbeat.output.events.duplicates": [ 71111 ], "host.os.type": [ "linux" ], "beat.stats.libbeat.output.events.batches": [ 479155 ], "beat.stats.libbeat.pipeline.events.filtered": [ 12 ], "host.architecture": [ "x86_64" ], "agent.id": [ "4ba5a4aa-848d-4435-a13e-ace9584cddaa" ], "beat.stats.cpu.user.ticks": [ 12374560 ], "beat.stats.cgroup.cpu.stats.periods": [ 0 ], "host.containerized": [ false ], "service.address": [ "http://unix/stats" ], "beat.stats.libbeat.output.events.active": [ 50 ], "beat.stats.libbeat.pipeline.events.total": [ 23940457 ], "beat.stats.cpu.total.ticks": [ 13560390 ], "beat.stats.info.version": [ "8.4.2" ], "beat.stats.handles.limit.hard": [ 524288 ], "beat.stats.runtime.goroutines": [ 289 ], "beat.stats.beat.name": [ "my_host.my.dom" ], "beat.stats.beat.version": [ "8.4.2" ], "beat.stats.libbeat.pipeline.queue.acked": [ 23936326 ], "beat.stats.libbeat.config.running": [ 0 ], "host.ip": [ "fe80::1ac0:4dff:fe7a:5796", "1.1.1.1", "fe80::ba59:9fff:fed2:222", "1.1.1.1", "192.168.43.1", "fe80::42:73ff:fe00:5c0a", "10.0.3.1", "fe80::c5e:4ff:feff:61bb", "fe80::b4c7:89ff:fe39:ef4d", "fe80::8014:f6ff:fe19:bd2a", "fe80::f49b:71ff:fef1:1993", "fe80::9454:a4ff:fee7:cdbf", "fe80::dcd6:6ff:fee9:c14d", "fe80::34fa:29ff:fe12:af8a", "fe80::866:4ff:fe83:8873", "fe80::7059:d7ff:fe3d:76a4", "fe80::446d:53ff:febc:ede3", "fe80::841e:e9ff:feda:42e", "fe80::6086:33ff:feee:6f81", "fe80::d45a:b8ff:fe91:c903", "fe80::14a4:11ff:fed1:a6a5", "fe80::dc56:88ff:fe6d:a3ba", "fe80::2c5d:f2ff:fec6:b0eb", "fe80::2803:d9ff:fe72:8edd", "fe80::8ce0:aff:fef4:f9df", "fe80::2c9d:77ff:fed3:b27a", "fe80::8875:6ff:fe86:64f9", "fe80::bc5e:83ff:feb7:348a", "fe80::d483:27ff:fecb:8415", "fe80::c0ba:99ff:fe90:cedf", "fe80::ac20:56ff:fe39:da07", "fe80::7c8b:dfff:fe76:c1ee", "fe80::7016:58ff:fef8:e362", "fe80::5482:78ff:fefc:2464", "fe80::586d:18ff:fe90:f306" ], "agent.type": [ "metricbeat" ], "beat.stats.libbeat.output.read.errors": [ 0 ], "beat.stats.memstats.gc_next": [ 322126080 ], "elastic_agent.snapshot": [ false ], "beat.id": [ "a97378a4-c48f-45c2-901b-7b8b5f6b6502" ], "host.id": [ "06d09b925a0246dcbd044c0036e01c9c" ], "beat.stats.system.load.15": [ 8.53 ], "beat.stats.libbeat.pipeline.clients": [ 20 ], "elastic_agent.id": [ "4ba5a4aa-848d-4435-a13e-ace9584cddaa" ], "beat.stats.info.uptime.ms": [ 27149727 ], "beat.stats.cgroup.cpuacct.total.ns": [ 4050343270691067 ], "metricset.period": [ 10000 ], "host.os.codename": [ "focal" ], "beat.stats.libbeat.output.events.dropped": [ 23671 ], "event.duration": [ 28331585 ], "beat.stats.system.cpu.cores": [ 128 ], "event.ingested": [ "2022-10-09T05:12:52.000Z" ], "@timestamp": [ "2022-10-09T05:12:51.923Z" ], "host.os.platform": [ "ubuntu" ], "data_stream.dataset": [ "elastic_agent.filebeat" ], "beat.stats.system.load.norm.15": [ 0.0666 ], "agent.ephemeral_id": [ "2a66e283-d27a-433b-8b39-ae7105878348" ], "beat.stats.handles.limit.soft": [ 1024 ], "beat.stats.info.name": [ "filebeat" ], "beat.type": [ "filebeat" ], "beat.stats.libbeat.config.reloads": [ 0 ], "beat.stats.libbeat.output.events.acked": [ 23841544 ], "beat.stats.beat.uuid": [ "a97378a4-c48f-45c2-901b-7b8b5f6b6502" ], "host.os.name": [ "Ubuntu" ], "beat.stats.uptime.ms": [ 27149727 ], "beat.stats.libbeat.output.type": [ "elasticsearch" ], "host.name": [ "my_host.my.dom" ], "beat.stats.libbeat.output.events.toomany": [ 0 ], "beat.stats.libbeat.output.events.failed": [ 0 ], "beat.stats.libbeat.pipeline.events.dropped": [ 0 ], "beat.stats.beat.host": [ "idsece02.log.unibe.ch" ], "data_stream.type": [ "metrics" ], "ecs.version": [ "8.0.0" ], "beat.stats.cgroup.cpu.cfs.period.us": [ 100000 ], "agent.version": [ "8.4.2" ], "host.os.family": [ "debian" ], "beat.stats.cpu.total.time.ms": [ 13560390 ], "beat.stats.libbeat.output.write.errors": [ 0 ], "beat.stats.memstats.memory.total": [ 1581247076272 ], "beat.stats.handles.open": [ 148 ], "beat.stats.libbeat.output.events.total": [ 23936376 ], "beat.stats.memstats.memory.alloc": [ 286642512 ], "beat.stats.system.load.norm.5": [ 0.0672 ], "beat.stats.system.load.norm.1": [ 0.0559 ], "beat.stats.libbeat.pipeline.events.active": [ 4119 ], "event.module": [ "beat" ], "host.os.kernel": [ "5.4.0-104-generic" ], "beat.stats.cgroup.cpu.stats.throttled.ns": [ 0 ], "beat.stats.info.ephemeral_id": [ "09bd12a2-0e46-4f6a-9c34-cf67b4a8ae42" ], "beat.stats.cgroup.cpu.stats.throttled.periods": [ 0 ], "beat.stats.cpu.system.time.ms": [ 1185830 ], "service.name": [ "beat" ], "beat.elasticsearch.cluster.id": [ "tOaF05W8QCG1Q3z25K4WZQ" ], "data_stream.namespace": [ "ece" ], "beat.stats.system.load.1": [ 7.16 ], "beat.stats.cgroup.memory.mem.usage.bytes": [ 2462949376 ], "beat.stats.cpu.system.ticks": [ 1185830 ], "beat.stats.cgroup.memory.mem.limit.bytes": [ 53687091200 ], "beat.stats.libbeat.output.read.bytes": [ 515468121 ], "beat.stats.system.load.5": [ 8.6 ], "beat.stats.cgroup.cpuacct.id": [ "elastic-agent.service" ], "metricset.name": [ "stats" ], "beat.stats.libbeat.pipeline.events.retry": [ 2048 ], "beat.stats.memstats.rss": [ 546992128 ], "event.dataset": [ "elastic_agent.filebeat" ] } }

my issue is with:

"beat.stats.libbeat.output.events.dropped": [ 23671 ],

and not with

"beat.stats.libbeat.pipeline.events.dropped": [ 0 ],

@6fears7 Does this mean that my events are not dropped in the pipelines?

zez3 commented 2 years ago

So, I get a lot of such messages:

Click to expand!

{ "_index": ".ds-logs-elastic_agent.filebeat-ece-2022.09.28-000029", "_id": "u0iIu4MBozNqveEQYO8z", "_version": 1, "_score": 0, "_source": { "container": { "id": "elastic-agent-d3eb3e" }, "agent": { "name": "myhost.my.dom", "id": "7b48d297-3b7b-4993-9525-8d398719e062", "type": "filebeat", "ephemeral_id": "b841ef51-a943-47b0-b224-2aa86de88024", "version": "8.4.2" }, "service.name": "filebeat", "log": { "file": { "path": "/opt/Elastic/Agent/data/elastic-agent-d3eb3e/logs/default/filebeat-20221009-8.ndjson" }, "offset": 378916652 }, "elastic_agent": { "id": "7b48d297-3b7b-4993-9525-8d398719e062", "version": "8.4.2", "snapshot": false }, "message": "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2022, time.October, 9, 8, 56, 53, 910440446, time.Local), Meta:{\"input_id\":\"udp-fortinet_fortigate-4cd23f99-4dfb-401f-817f-5ef6f9d4ca05\",\"raw_index\":\"logs-fortinet_fortigate.log-ece_forti\",\"stream_id\":\"udp-fortinet_fortigate.log-4cd23f99-4dfb-401f-817f-5ef6f9d4ca05\",\"truncated\":false}, Fields:{\"agent\":{\"ephemeral_id\":\"f754f48a-8030-41f9-ae3f-dd4e7a8edbd9\",\"id\":\"7b48d297-3b7b-4993-9525-8d398719e062\",\"name\":\"myhost.my.dom\",\"type\":\"filebeat\",\"version\":\"8.4.2\"},\"data_stream\":{\"dataset\":\"fortinet_fortigate.log\",\"namespace\":\"ece_forti\",\"type\":\"logs\"},\"ecs\":{\"version\":\"8.0.0\"},\"elastic_agent\":{\"id\":\"7b48d297-3b7b-4993-9525-8d398719e062\",\"snapshot\":false,\"version\":\"8.4.2\"},\"event\":{\"dataset\":\"fortinet_fortigate.log\"},\"input\":{\"type\":\"udp\"},\"log\":{\"source\":{\"address\":\"1.1.1.1:24676\"}},\"message\":\"\\u003c190\\u003edate=2022-10-09 time=08:56:53 devname=\\\"UniLabFW\\\" devid=\\\"FG100FTK20025470\\\" eventtime=1665298613860494320 tz=\\\"+0200\\\" logid=\\\"0112053203\\\" type=\\\"event\\\" subtype=\\\"connector\\\" level=\\\"information\\\" vd=\\\"Border\\\" logdesc=\\\"Dynamic address updated.\\\" addr=\\\"FCTEMS_ALL_FORTICLOUD_SERVERS\\\" msg=\\\"Updated tag FCTEMS_ALL_FORTICLOUD_SERVERS\\\"\",\"tags\":[\"fortinet-fortigate\",\"fortinet-firewall\",\"forwarded\"]}, Private:interface {}(nil), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [fortinet.firewall.addr] of type [ip] in document with id 'VkiIu4MBozNqveEQSZDG'. Preview of field's value: 'FCTEMS_ALL_FORTICLOUD_SERVERS'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"'FCTEMS_ALL_FORTICLOUD_SERVERS' is not an IP string literal.\"}}, dropping event!", "log.logger": "elasticsearch", "log.origin": { "file.line": 429, "file.name": "elasticsearch/client.go" }, "input": { "type": "filestream" }, "@timestamp": "2022-10-09T06:56:54.742Z", "ecs": { "version": "8.0.0" }, "data_stream": { "namespace": "ece", "type": "logs", "dataset": "elastic_agent.filebeat" }, "host": { "hostname": "myhost.my.dom", "os": { "kernel": "5.4.0-125-generic", "codename": "focal", "name": "Ubuntu", "type": "linux", "family": "debian", "version": "20.04.5 LTS (Focal Fossa)", "platform": "ubuntu" }, "containerized": false, "ip": [ "fe80::1ac0:4dff:fe7a:560a", "1.1.1.1", "fe80::ba59:9fff:fed2:4e6", "10.0.3.1", "192.168.43.1", "fe80::42:d3ff:fe9e:1acf", "fe80::602f:cbff:fee8:aba1", "fe80::b09f:c2ff:feda:9a3f", "fe80::5c79:49ff:febe:1ace", "fe80::d850:fbff:fe99:98fe", "fe80::f872:d1ff:fece:32cc", "fe80::ccd5:2dff:fe5b:32f8", "fe80::a6:65ff:fe4d:3a83", "fe80::80c8:c2ff:fe88:3e9d", "fe80::88a3:97ff:feb0:d411", "fe80::6c0e:2ff:fe33:8663", "fe80::e8cd:80ff:fe0b:94e8", "fe80::d429:7eff:fe77:efdf", "fe80::b882:f4ff:fefa:32dc", "fe80::cc7d:73ff:fe22:422f", "fe80::f0cc:c5ff:fe65:d8ec", "fe80::82:47ff:fe79:a2f9", "fe80::680e:d3ff:fe45:9e6", "fe80::2490:2aff:fe9f:1157", "fe80::944b:eaff:fef8:1ce6", "fe80::b4c8:fdff:fe59:6a6c", "fe80::45b:edff:fe5d:305e", "fe80::301b:16ff:fe60:7713", "fe80::a42d:eff:fe18:442a", "fe80::64cb:b9ff:fe97:6305", "fe80::600b:ebff:fedb:2e4f", "fe80::dc91:3bff:fe29:b75b", "fe80::5042:7eff:fe93:f008", "fe80::605f:84ff:fea7:9128" ], "name": "myhost.my.dom", "id": "aa3d81edad7148488d332274f0ec28e1", "mac": [ "00:16:3e:00:00:00", "02:42:d3:9e:1a:cf", "02:82:47:79:a2:f9", "02:a6:65:4d:3a:83", "06:5b:ed:5d:30:5e", "18:c0:4d:7a:56:0a", "18:c0:4d:7a:56:0b", "26:90:2a:9f:11:57", "32:1b:16:60:77:13", "52:42:7e:93:f0:08", "5e:79:49:be:1a:ce", "62:0b:eb:db:2e:4f", "62:2f:cb:e8:ab:a1", "62:5f:84:a7:91:28", "66:cb:b9:97:63:05", "68:05:ca:13:70:e2", "68:05:ca:13:70:e3", "6a:0e:d3:45:09:e6", "6e:0e:02:33:86:63", "82:c8:c2:88:3e:9d", "8a:a3:97:b0:d4:11", "96:4b:ea:f8:1c:e6", "a6:2d:0e:18:44:2a", "b2:9f:c2:da:9a:3f", "b6:c8:fd:59:6a:6c", "b8:59:9f:d2:04:e6", "ba:82:f4:fa:32:dc", "ce:7d:73:22:42:2f", "ce:d5:2d:5b:32:f8", "d6:29:7e:77:ef:df", "da:50:fb:99:98:fe", "de:91:3b:29:b7:5b", "ea:cd:80:0b:94:e8", "f2:cc:c5:65:d8:ec", "fa:72:d1:ce:32:cc" ], "architecture": "x86_64" }, "log.level": "warn", "event": { "agent_id_status": "verified", "ingested": "2022-10-09T06:57:00Z", "dataset": "elastic_agent.filebeat" } }, "fields": { "elastic_agent.version": [ "8.4.2" ], "host.os.name.text": [ "Ubuntu" ], "host.hostname": [ "myhost.my.dom" ], "host.mac": [ "00:16:3e:00:00:00", "02:42:d3:9e:1a:cf", "02:82:47:79:a2:f9", "02:a6:65:4d:3a:83", "06:5b:ed:5d:30:5e", "18:c0:4d:7a:56:0a", "18:c0:4d:7a:56:0b", "26:90:2a:9f:11:57", "32:1b:16:60:77:13", "52:42:7e:93:f0:08", "5e:79:49:be:1a:ce", "62:0b:eb:db:2e:4f", "62:2f:cb:e8:ab:a1", "62:5f:84:a7:91:28", "66:cb:b9:97:63:05", "68:05:ca:13:70:e2", "68:05:ca:13:70:e3", "6a:0e:d3:45:09:e6", "6e:0e:02:33:86:63", "82:c8:c2:88:3e:9d", "8a:a3:97:b0:d4:11", "96:4b:ea:f8:1c:e6", "a6:2d:0e:18:44:2a", "b2:9f:c2:da:9a:3f", "b6:c8:fd:59:6a:6c", "b8:59:9f:d2:04:e6", "ba:82:f4:fa:32:dc", "ce:7d:73:22:42:2f", "ce:d5:2d:5b:32:f8", "d6:29:7e:77:ef:df", "da:50:fb:99:98:fe", "de:91:3b:29:b7:5b", "ea:cd:80:0b:94:e8", "f2:cc:c5:65:d8:ec", "fa:72:d1:ce:32:cc" ], "log.logger": [ "elasticsearch" ], "container.id": [ "elastic-agent-d3eb3e" ], "host.ip": [ "fe80::1ac0:4dff:fe7a:560a", "1.1.1.1", "fe80::ba59:9fff:fed2:4e6", "10.0.3.1", "192.168.43.1", "fe80::42:d3ff:fe9e:1acf", "fe80::602f:cbff:fee8:aba1", "fe80::b09f:c2ff:feda:9a3f", "fe80::5c79:49ff:febe:1ace", "fe80::d850:fbff:fe99:98fe", "fe80::f872:d1ff:fece:32cc", "fe80::ccd5:2dff:fe5b:32f8", "fe80::a6:65ff:fe4d:3a83", "fe80::80c8:c2ff:fe88:3e9d", "fe80::88a3:97ff:feb0:d411", "fe80::6c0e:2ff:fe33:8663", "fe80::e8cd:80ff:fe0b:94e8", "fe80::d429:7eff:fe77:efdf", "fe80::b882:f4ff:fefa:32dc", "fe80::cc7d:73ff:fe22:422f", "fe80::f0cc:c5ff:fe65:d8ec", "fe80::82:47ff:fe79:a2f9", "fe80::680e:d3ff:fe45:9e6", "fe80::2490:2aff:fe9f:1157", "fe80::944b:eaff:fef8:1ce6", "fe80::b4c8:fdff:fe59:6a6c", "fe80::45b:edff:fe5d:305e", "fe80::301b:16ff:fe60:7713", "fe80::a42d:eff:fe18:442a", "fe80::64cb:b9ff:fe97:6305", "fe80::600b:ebff:fedb:2e4f", "fe80::dc91:3bff:fe29:b75b", "fe80::5042:7eff:fe93:f008", "fe80::605f:84ff:fea7:9128" ], "agent.type": [ "filebeat" ], "host.os.version": [ "20.04.5 LTS (Focal Fossa)" ], "host.os.kernel": [ "5.4.0-125-generic" ], "host.os.name": [ "Ubuntu" ], "log.level": [ "warn" ], "agent.name": [ "myhost.my.dom" ], "elastic_agent.snapshot": [ false ], "host.name": [ "myhost.my.dom" ], "event.agent_id_status": [ "verified" ], "host.id": [ "aa3d81edad7148488d332274f0ec28e1" ], "log.origin.file.line": [ 429 ], "service.name": [ "filebeat" ], "host.os.type": [ "linux" ], "elastic_agent.id": [ "7b48d297-3b7b-4993-9525-8d398719e062" ], "data_stream.namespace": [ "ece" ], "host.os.codename": [ "focal" ], "input.type": [ "filestream" ], "log.offset": [ 378916652 ], "message": [ "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2022, time.October, 9, 8, 56, 53, 910440446, time.Local), Meta:{\"input_id\":\"udp-fortinet_fortigate-4cd23f99-4dfb-401f-817f-5ef6f9d4ca05\",\"raw_index\":\"logs-fortinet_fortigate.log-ece_forti\",\"stream_id\":\"udp-fortinet_fortigate.log-4cd23f99-4dfb-401f-817f-5ef6f9d4ca05\",\"truncated\":false}, Fields:{\"agent\":{\"ephemeral_id\":\"f754f48a-8030-41f9-ae3f-dd4e7a8edbd9\",\"id\":\"7b48d297-3b7b-4993-9525-8d398719e062\",\"name\":\"myhost.my.dom\",\"type\":\"filebeat\",\"version\":\"8.4.2\"},\"data_stream\":{\"dataset\":\"fortinet_fortigate.log\",\"namespace\":\"ece_forti\",\"type\":\"logs\"},\"ecs\":{\"version\":\"8.0.0\"},\"elastic_agent\":{\"id\":\"7b48d297-3b7b-4993-9525-8d398719e062\",\"snapshot\":false,\"version\":\"8.4.2\"},\"event\":{\"dataset\":\"fortinet_fortigate.log\"},\"input\":{\"type\":\"udp\"},\"log\":{\"source\":{\"address\":\"1.1.1.1:24676\"}},\"message\":\"\\u003c190\\u003edate=2022-10-09 time=08:56:53 devname=\\\"UniLabFW\\\" devid=\\\"FG100FTK20025470\\\" eventtime=1665298613860494320 tz=\\\"+0200\\\" logid=\\\"0112053203\\\" type=\\\"event\\\" subtype=\\\"connector\\\" level=\\\"information\\\" vd=\\\"Border\\\" logdesc=\\\"Dynamic address updated.\\\" addr=\\\"FCTEMS_ALL_FORTICLOUD_SERVERS\\\" msg=\\\"Updated tag FCTEMS_ALL_FORTICLOUD_SERVERS\\\"\",\"tags\":[\"fortinet-fortigate\",\"fortinet-firewall\",\"forwarded\"]}, Private:interface {}(nil), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [fortinet.firewall.addr] of type [ip] in document with id 'VkiIu4MBozNqveEQSZDG'. Preview of field's value: 'FCTEMS_ALL_FORTICLOUD_SERVERS'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"'FCTEMS_ALL_FORTICLOUD_SERVERS' is not an IP string literal.\"}}, dropping event!" ], "data_stream.type": [ "logs" ], "host.architecture": [ "x86_64" ], "event.ingested": [ "2022-10-09T06:57:00.000Z" ], "@timestamp": [ "2022-10-09T06:56:54.742Z" ], "log.origin.file.name": [ "elasticsearch/client.go" ], "agent.id": [ "7b48d297-3b7b-4993-9525-8d398719e062" ], "host.containerized": [ false ], "ecs.version": [ "8.0.0" ], "host.os.platform": [ "ubuntu" ], "data_stream.dataset": [ "elastic_agent.filebeat" ], "log.file.path": [ "/opt/Elastic/Agent/data/elastic-agent-d3eb3e/logs/default/filebeat-20221009-8.ndjson" ], "agent.ephemeral_id": [ "b841ef51-a943-47b0-b224-2aa86de88024" ], "agent.version": [ "8.4.2" ], "host.os.family": [ "debian" ], "event.dataset": [ "elastic_agent.filebeat" ] } }

In the above case they come from an official Fortinet integration

but I also see other from my custom parsing:


"message": "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2022, time.October, 9, 7, 17, 38, 79000000, time.UTC), Meta:{\"input_id\":\"udp-udp-e455bb9c-05f0-4d42-bd13-60593992bb55\",\"raw_index\":\"logs-udp.generic-ece_wlc_logs\",\"stream_id\":\"udp-udp.generic-e455bb9c-05f0-4d42-bd13-60593992bb55\",\"truncated\":false}, Fields:{\"agent\":{\"ephemeral_id\":\"09bd12a2-0e46-4f6a-9c34-cf67b4a8ae42\",\"id\":\"4ba5a4aa-848d-4435-a13e-ace9584cddaa\",\"name\":\"myhost.my.dom\",\"type\":\"filebeat\",\"version\":\"8.4.2\"},\"client\":{\"mac\":\"22:7c:71:90:17:a5\"},\"data_stream\":{\"dataset\":\"udp.generic\",\"namespace\":\"ece_wlc_logs\",\"type\":\"logs\"},\"ecs\":{\"version\":\"8.0.0\"},\"elastic_agent\":{\"id\":\"4ba5a4aa-848d-4435-a13e-ace9584cddaa\",\"snapshot\":false,\"version\":\"8.4.2\"},\"event\":{\"action\":\"Client Authenticated\",\"dataset\":\"udp.generic\",\"provider\":\"APF-3-AUTHENTICATION_TRAP\",\"timezone\":\"CEST\"},\"input\":{\"type\":\"udp\"},\"log\":{\"logger\":\"haSSOServiceTask5\",\"source\":{\"address\":\"1.1.1.1:32837\"},\"syslog\":{\"hostname\":\"cisco-wlc-mgmt\"}},\"message\":\"\\u003c139\\u003ecisco-wlc-mgmt: *haSSOServiceTask5: Oct 09 09:17:38.079: %APF-3-AUTHENTICATION_TRAP: [PS]apf_80211.c:19558 Client Authenticated: MACAddress:22:7c:71:90:17:a5 Base Radio MAC:00:f2:8b:4d:bf:00 Slot:1 User Name:unknown Ip Address:unknown SSID:public-unibe\",\"oldwlc\":{\"orig\":{}},\"orig\":{\"timestamp\":\"Oct 09 09:17:38.079\"},\"source\":{\"ip\":\"unknown\",\"mac\":\"22:7c:71:90:17:a5\"},\"tags\":[\"wlc\",\"forwarded\",\"_dns_reverse_lookup_failed\"],\"user\":{\"name\":\"unknown\"},\"wlc\":{\"baseradiomac\":\"00:f2:8b:4d:bf:00\",\"ssid\":\"public-unibe\"}}, Private:interface {}(nil), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [source.ip] of type [ip] in document with id 'YWObu4MBF-Kn07JyVca4'. Preview of field's value: 'unknown'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"'unknown' is not an IP string literal.\"}}, dropping event!",

nkakouros commented 2 years ago

I think it is important, whatever the explanation given here, to update the Kibana UI to explain the metrics in layman's terms and to give optical feedback of when data loss is occurring, together with links to docs on how to deal with this.

botelastic[bot] commented 1 year ago

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

ajax-shmyrko-o commented 10 months ago

+1, We are using Datadog to collect these metrics:

libbeat.output.events.dropped
libbeat.pipeline.events.dropped
libbeat.output.events.failed
libbeat.pipeline.events.failed

So it would be nice to have an explanation about their meaning in the documentation.

elastic / beats

Clarify documentation of beats output errors #27763