Open jsoriano opened 3 years ago
Pinging @elastic/integrations (Team:Integrations)
Pinging @elastic/stack-monitoring (Stack monitoring)
The user facing part of this (i.e. the kibana interface) could also benefit from a bit more clarification. For instance, in the image below, what should I be worried about? Do I have dropped events that got lost forever without being logged? How does "Output Errors" relate to "Fail Rates"?
After reading the discussion and the docs I still don't know about the exact meaning of those metrics in the stack monitoring. I think we all agree that one of the most important questions a user has is: "Do I lose data or not?" I still can't answer this question. For example: I'm using packetbeat to gather DNS traffic and from time to time the "Failed in Pipeline" counter is jumping from 0 to 1000 and as stated above I'm asking myself: Do I lose those DNS queries or not?
Also I don't really get the difference between the two graphs "Fail Rates" and "Output Errors". To me it sounds like they are both doing the same, but they probably don't?
I would also like to know the meaning of these values, especially beat.stats.libbeat.output.events.dropped
, if that is larger than 0 am I losing data?
Inside the code for Libbeat:
//
// Output event stats
//
batches *monitoring.Uint // total number of batches processed by output
events *monitoring.Uint // total number of events processed by output
acked *monitoring.Uint // total number of events ACKed by output
failed *monitoring.Uint // total number of events failed in output
active *monitoring.Uint // events sent and waiting for ACK/fail from output
duplicates *monitoring.Uint // events sent and waiting for ACK/fail from output
dropped *monitoring.Uint // total number of invalid events dropped by the output
tooMany *monitoring.Uint // total number of too many requests replies from output
//
// Output network connection stats
//
writeBytes *monitoring.Uint // total amount of bytes written by output
writeErrors *monitoring.Uint // total number of errors on write
readBytes *monitoring.Uint // total amount of bytes read
readErrors *monitoring.Uint // total number of errors while waiting for response on output
}
When you're querying in Elastic for results of Libbeat (see below), the Output Errors is derived from the measured delta between the initial timestamp's readErrors + writeErrors and the latest timestamp's readErrors + writeErrors. According to the code commentary then, Output Errors is the number of network packets experiencing errors.
The below example is utilizing apm-server as the beat type, but you can replace it to suit your needs.
GET _search
{
"query": {
"bool": {
"filter": [
{
"bool": {
"should": [
{
"term": {
"data_stream.dataset": "beats.stats"
}
},
{
"term": {
"metricset.name": "stats"
}
},
{
"term": {
"type": "beats_stats"
}
}
]
}
},
{
"term": {
"cluster_uuid": "CLUSTER_UUID"
}
},
{
"range": {
"beats_stats.timestamp": {
"format": "epoch_millis",
"gte": 1665053615330,
"lte": 1665054515330
}
}
},
{
"bool": {
"must": {
"term": {
"beats_stats.beat.type": "apm-server"
}
}
}
}
]
}
},
"collapse": {
"field": "beats_stats.metrics.beat.info.ephemeral_id",
"inner_hits": {
"name": "earliest",
"size": 1,
"sort": [
{
"beats_stats.timestamp": {
"order": "asc",
"unmapped_type": "long"
}
},
{
"@timestamp": {
"order": "asc",
"unmapped_type": "long"
}
}
]
}
},
"sort": [
{
"beats_stats.beat.uuid": {
"order": "asc",
"unmapped_type": "long"
}
},
{
"timestamp": {
"order": "desc",
"unmapped_type": "long"
}
}
]
}
According to the code commentary then, Output Errors is the number of network packets experiencing errors.
@6fears7 So, we have event loss or not? I don't have any other errors, like 429
I've added this beat.stats.libbeat.output.events.dropped
to my Agent dashboard
eg.
monitor filebeat drops - * is in a state of ALERT
Reason:
beat.stats.libbeat.output.events.dropped is 3,313,220.83422 in the last 1 day for all hosts. Alert when > 0.
That 3 mil./day events are kind of a lot from my customer point of view
@6fears7 Well that still does not answer if I have lost events or not. An "error" doesn't necessarily mean that I have lost events, I don't know if after those errors occur some kind of mechanism tries to process those faulty events again or not. Telling the users just "Error" without clearly letting them know the consequences of it does not help. A graph which says "Events lost" would be clear and helpful in my opinion.
I've added this
beat.stats.libbeat.output.events.dropped
to my Agent dashboard* I get an daily alarm telling me that,
eg.
monitor filebeat drops - * is in a state of ALERT Reason: beat.stats.libbeat.output.events.dropped is 3,313,220.83422 in the last 1 day for all hosts. Alert when > 0.
That 3 mil./day events are kind of a lot from my customer point of view
In client.go, we are given an initial struct of:
type bulkResultStats struct {
acked int // number of events ACKed by Elasticsearch
duplicates int // number of events failed with `create` due to ID already being indexed
fails int // number of failed events (can be retried)
nonIndexable int // number of failed events (not indexable)
tooMany int // number of events receiving HTTP 429 Too Many Requests
}
Later, we see what constitutes a drop:
failed := len(failedEvents)
span.Context.SetLabel("events_failed", failed)
if st := client.observer; st != nil {
dropped := stats.nonIndexable
duplicates := stats.duplicates
acked := len(data) - failed - dropped - duplicates
st.Acked(acked)
st.Failed(failed)
st.Dropped(dropped)
st.Duplicate(duplicates)
st.ErrTooMany(stats.tooMany)
}
So dropped events would be those of the nonIndexable type.
In order to determine what a "nonIndexable" type is, the code iterates through the Bulk results:
...
if status < 500 {
if status == http.StatusTooManyRequests {
stats.tooMany++
} else {
// hard failure, apply policy action
result, _ := data[i].Content.Meta.HasKey(dead_letter_marker_field)
if result {
stats.nonIndexable++
client.log.Errorf("Can't deliver to dead letter index event %#v (status=%v): %s", data[i], status, msg)
// poison pill - this will clog the pipeline if the underlying failure is non transient.
} else if client.NonIndexableAction == dead_letter_index {
client.log.Warnf("Cannot index event %#v (status=%v): %s, trying dead letter index", data[i], status, msg)
if data[i].Content.Meta == nil {
data[i].Content.Meta = common.MapStr{
dead_letter_marker_field: true,
}
} else {
data[i].Content.Meta.Put(dead_letter_marker_field, true)
}
data[i].Content.Fields = common.MapStr{
"message": data[i].Content.Fields.String(),
"error.type": status,
"error.message": string(msg),
}
} else { // drop
stats.nonIndexable++
client.log.Warnf("Cannot index event %#v (status=%v): %s, dropping event!", data[i], status, msg)
continue
}
So any event that cannot be indexed (as an example here) will become tagged under the dead_letter_marker_field and dropped from the pipeline.
To answer your question, I do believe that those events are worth looking into. My first guess would be to check how the pipeline is being parsed and if there's some Grok that needs to be done.
Thank you for clarifying @6fears7
I use official pipelines from Fleet integrations but also custom ones. I also use dissect and the rest of filebeat processor.(dns is still missing from elasticsearch processor) @sachin-frayne Can you help now to further identify those events and/or check the pipelines? as sugested above.
my issue is with:
"beat.stats.libbeat.output.events.dropped": [ 23671 ],
and not with
"beat.stats.libbeat.pipeline.events.dropped": [ 0 ],
@6fears7 Does this mean that my events are not dropped in the pipelines?
So, I get a lot of such messages:
In the above case they come from an official Fortinet integration
but I also see other from my custom parsing:
"message": "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2022, time.October, 9, 7, 17, 38, 79000000, time.UTC), Meta:{\"input_id\":\"udp-udp-e455bb9c-05f0-4d42-bd13-60593992bb55\",\"raw_index\":\"logs-udp.generic-ece_wlc_logs\",\"stream_id\":\"udp-udp.generic-e455bb9c-05f0-4d42-bd13-60593992bb55\",\"truncated\":false}, Fields:{\"agent\":{\"ephemeral_id\":\"09bd12a2-0e46-4f6a-9c34-cf67b4a8ae42\",\"id\":\"4ba5a4aa-848d-4435-a13e-ace9584cddaa\",\"name\":\"myhost.my.dom\",\"type\":\"filebeat\",\"version\":\"8.4.2\"},\"client\":{\"mac\":\"22:7c:71:90:17:a5\"},\"data_stream\":{\"dataset\":\"udp.generic\",\"namespace\":\"ece_wlc_logs\",\"type\":\"logs\"},\"ecs\":{\"version\":\"8.0.0\"},\"elastic_agent\":{\"id\":\"4ba5a4aa-848d-4435-a13e-ace9584cddaa\",\"snapshot\":false,\"version\":\"8.4.2\"},\"event\":{\"action\":\"Client Authenticated\",\"dataset\":\"udp.generic\",\"provider\":\"APF-3-AUTHENTICATION_TRAP\",\"timezone\":\"CEST\"},\"input\":{\"type\":\"udp\"},\"log\":{\"logger\":\"haSSOServiceTask5\",\"source\":{\"address\":\"1.1.1.1:32837\"},\"syslog\":{\"hostname\":\"cisco-wlc-mgmt\"}},\"message\":\"\\u003c139\\u003ecisco-wlc-mgmt: *haSSOServiceTask5: Oct 09 09:17:38.079: %APF-3-AUTHENTICATION_TRAP: [PS]apf_80211.c:19558 Client Authenticated: MACAddress:22:7c:71:90:17:a5 Base Radio MAC:00:f2:8b:4d:bf:00 Slot:1 User Name:unknown Ip Address:unknown SSID:public-unibe\",\"oldwlc\":{\"orig\":{}},\"orig\":{\"timestamp\":\"Oct 09 09:17:38.079\"},\"source\":{\"ip\":\"unknown\",\"mac\":\"22:7c:71:90:17:a5\"},\"tags\":[\"wlc\",\"forwarded\",\"_dns_reverse_lookup_failed\"],\"user\":{\"name\":\"unknown\"},\"wlc\":{\"baseradiomac\":\"00:f2:8b:4d:bf:00\",\"ssid\":\"public-unibe\"}}, Private:interface {}(nil), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [source.ip] of type [ip] in document with id 'YWObu4MBF-Kn07JyVca4'. Preview of field's value: 'unknown'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"'unknown' is not an IP string literal.\"}}, dropping event!",
I think it is important, whatever the explanation given here, to update the Kibana UI to explain the metrics in layman's terms and to give optical feedback of when data loss is occurring, together with links to docs on how to deal with this.
Hi! We just realized that we haven't looked into this issue in a while. We're sorry!
We're labeling this issue as Stale
to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1
.
Thank you for your contribution!
+1, We are using Datadog to collect these metrics:
libbeat.output.events.dropped
libbeat.pipeline.events.dropped
libbeat.output.events.failed
libbeat.pipeline.events.failed
So it would be nice to have an explanation about their meaning in the documentation.
There are several metrics reporting output errors from the
beat
module, clarify their meaning in the docs, focusing on how worrisome these errors are for users.For example it is not clear if a non-zero
beat.stats.libbeat.output.write.errors
implied some data loss, though it probably didn't ifbeat.stats.libbeat.output.events.dropped
is zero.For confirmed bugs, please report: