elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
96 stars 4.92k forks source link

Dataloss on daemonset upgrade #35145

Open michalpristas opened 1 year ago

michalpristas commented 1 year ago

Issue

Tested scenario:

script used

#!/bin/sh 
i=0

until [ $i -gt 30000 ]
do
  printf "i: %s\n" "$i"
  ((i=i+1))
  i=$((i+1))
  sleep 2
done

also reproducible on 8.6.2->8.7.0

Definition of done

elasticmachine commented 1 year ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz commented 1 year ago

Confirmed to be reproducible upgrading from 8.6.2 to 8.7.0

https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/

With RollingUpdate update strategy, after you update a DaemonSet template, old DaemonSet pods will be killed, and new DaemonSet pods will be created automatically, in a controlled fashion. At most one pod of the DaemonSet will be running on each node during the whole update process.

The “at most one pod of the DaemonSet will be running” implies there must be a period where no Filebeat pod is running on the node, creating a window where logs could be missed. The thing is if the logs were persisted Filebeat should pick them up.

During the pod replacement there must be a window where events can be dropped.

pierrehilbert commented 1 year ago

Is the solution to enforce some min requirement for .spec.minReadySeconds and .spec.updateStrategy.rollingUpdate.maxSurge?

michalpristas commented 1 year ago

verified this issue is NOT kubernetes specific. it happens on restarts as well (not just upgrades).

when filebeat runnin on bare metal is restarted i see 2-3 events dropped (generated by the script in a description)

pierrehilbert commented 1 year ago

@belimawr / @rdner could it be events we harvested but not already sent to the output and therefore lost when we are restarting Filebeat?

belimawr commented 1 year ago

The offset of the files on the registry is only updated when the output acknowledges the events, hence they should already be safely stored on ES. log input and filestream input handle that differently, but they should achieve the same final result.

On a restart there should be no data loss. That is indeed pretty weird. @michalpristas could you share the filebeat.yml you used to reproduce it with a restart? I'm curious about both:

michalpristas commented 1 year ago

today i moved a bit forward. i did some changes, what seems like data loss is probably processor not being active when i filter documents based on container id or container name i see gaps but when i dont filter at all and go through all of the document i can find them without metadata attached

gapped document ``` { "_index": ".ds-filebeat-8.9.0-2023.05.31-000001", "_id": "fFwccYgBkmCYM84U6IjN", "_version": 1, "_score": 0, "_source": { "@timestamp": "2023-05-31T09:21:36.956Z", "log": { "file": { "path": "/var/log/containers/sh_default_sh-c22f9c96f1f712ad6ba10dbff9090601ef820470b8a167f1e767580101b0b713.log" }, "offset": 5161 }, "stream": "stdout", "input": { "type": "container" }, "ecs": { "version": "8.0.0" }, "host": { "mac": [ "02-42-D0-4F-EC-6D", "02-50-00-00-00-01", "12-84-0A-3A-C6-FC", "26-76-B8-65-DE-5B", "32-59-E4-32-D2-07", "7A-78-F3-67-EF-9F", "86-0C-AF-E4-B8-38", "EE-38-A8-AA-C7-B4", "F6-01-7D-69-72-A7" ], "name": "docker-desktop", "hostname": "docker-desktop", "architecture": "x86_64", "os": { "codename": "focal", "type": "linux", "platform": "ubuntu", "version": "20.04.6 LTS (Focal Fossa)", "family": "debian", "name": "Ubuntu", "kernel": "5.15.49-linuxkit" }, "containerized": false, "ip": [ "192.168.65.9", "fe80::50:ff:fe00:1", "192.168.65.4", "fe80::f401:7dff:fe69:72a7", "172.17.0.1", "fe80::42:d0ff:fe4f:ec6d", "10.1.0.1", "fe80::3059:e4ff:fe32:d207", "fe80::1084:aff:fe3a:c6fc", "fe80::7878:f3ff:fe67:ef9f", "fe80::ec38:a8ff:feaa:c7b4", "fe80::840c:afff:fee4:b838", "fe80::2476:b8ff:fe65:de5b" ] }, "agent": { "version": "8.9.0", "ephemeral_id": "dbd4202d-afcd-4db7-b7d9-1a8781e8bde7", "id": "f81c4dc3-3780-4df4-9e39-9ef757433dda", "name": "docker-desktop", "type": "filebeat" }, "message": "i: 35" }, "fields": { "host.os.name.text": [ "Ubuntu" ], "host.hostname": [ "docker-desktop" ], "host.mac": [ "02-42-D0-4F-EC-6D", "02-50-00-00-00-01", "12-84-0A-3A-C6-FC", "26-76-B8-65-DE-5B", "32-59-E4-32-D2-07", "7A-78-F3-67-EF-9F", "86-0C-AF-E4-B8-38", "EE-38-A8-AA-C7-B4", "F6-01-7D-69-72-A7" ], "host.ip": [ "192.168.65.9", "fe80::50:ff:fe00:1", "192.168.65.4", "fe80::f401:7dff:fe69:72a7", "172.17.0.1", "fe80::42:d0ff:fe4f:ec6d", "10.1.0.1", "fe80::3059:e4ff:fe32:d207", "fe80::1084:aff:fe3a:c6fc", "fe80::7878:f3ff:fe67:ef9f", "fe80::ec38:a8ff:feaa:c7b4", "fe80::840c:afff:fee4:b838", "fe80::2476:b8ff:fe65:de5b" ], "agent.type": [ "filebeat" ], "host.os.version": [ "20.04.6 LTS (Focal Fossa)" ], "stream": [ "stdout" ], "host.os.kernel": [ "5.15.49-linuxkit" ], "host.os.name": [ "Ubuntu" ], "agent.name": [ "docker-desktop" ], "host.name": [ "docker-desktop" ], "host.os.type": [ "linux" ], "host.os.codename": [ "focal" ], "input.type": [ "container" ], "log.offset": [ 5161 ], "message": [ "i: 35" ], "agent.hostname": [ "docker-desktop" ], "host.architecture": [ "x86_64" ], "@timestamp": [ "2023-05-31T09:21:36.956Z" ], "agent.id": [ "f81c4dc3-3780-4df4-9e39-9ef757433dda" ], "host.os.platform": [ "ubuntu" ], "ecs.version": [ "8.0.0" ], "host.containerized": [ false ], "log.file.path": [ "/var/log/containers/sh_default_sh-c22f9c96f1f712ad6ba10dbff9090601ef820470b8a167f1e767580101b0b713.log" ], "agent.ephemeral_id": [ "dbd4202d-afcd-4db7-b7d9-1a8781e8bde7" ], "agent.version": [ "8.9.0" ], "host.os.family": [ "debian" ] } } ```
Document after processors are loaded { ``` "_index": ".ds-filebeat-8.9.0-2023.05.31-000001", "_id": "0lwccYgBkmCYM84U6YjN", "_version": 1, "_score": 0, "_source": { "@timestamp": "2023-05-31T09:21:38.957Z", "stream": "stdout", "input": { "type": "container" }, "orchestrator": { "cluster": { "url": "vm.docker.internal:6443", "name": "kubernetes" } }, "log": { "offset": 5239, "file": { "path": "/var/log/containers/sh_default_sh-c22f9c96f1f712ad6ba10dbff9090601ef820470b8a167f1e767580101b0b713.log" } }, "message": "i: 36", "container": { "id": "c22f9c96f1f712ad6ba10dbff9090601ef820470b8a167f1e767580101b0b713", "runtime": "docker", "image": { "name": "busybox:latest" } }, "kubernetes": { "node": { "name": "docker-desktop", "uid": "c3d8930b-9ee2-437b-a972-3dbddd8aff88", "labels": { "node-role_kubernetes_io/control-plane": "", "node_kubernetes_io/exclude-from-external-load-balancers": "", "beta_kubernetes_io/arch": "amd64", "beta_kubernetes_io/os": "linux", "kubernetes_io/arch": "amd64", "kubernetes_io/hostname": "docker-desktop", "kubernetes_io/os": "linux" }, "hostname": "docker-desktop" }, "pod": { "name": "sh", "uid": "bd5f5f29-5bd3-404f-a978-b60675231d19", "ip": "10.1.0.28" }, "namespace": "default", "namespace_uid": "5398002e-4e40-480f-a1fe-6fc5e4225ba6", "namespace_labels": { "kubernetes_io/metadata_name": "default" }, "labels": { "run": "sh" }, "container": { "name": "sh" } }, "ecs": { "version": "8.0.0" }, "host": { "ip": [... ], "name": "docker-desktop", "mac": [... ], "hostname": "docker-desktop", "architecture": "x86_64", "os": { "codename": "focal", "type": "linux", "platform": "ubuntu", "version": "20.04.6 LTS (Focal Fossa)", "family": "debian", "name": "Ubuntu", "kernel": "5.15.49-linuxkit" }, "containerized": false }, "agent": { "type": "filebeat", "version": "8.9.0", "ephemeral_id": "dbd4202d-afcd-4db7-b7d9-1a8781e8bde7", "id": "f81c4dc3-3780-4df4-9e39-9ef757433dda", "name": "docker-desktop" } }, "fields": { "orchestrator.cluster.name": [ "kubernetes" ], "kubernetes.node.uid": [ "c3d8930b-9ee2-437b-a972-3dbddd8aff88" ], "kubernetes.namespace_uid": [ "5398002e-4e40-480f-a1fe-6fc5e4225ba6" ], "host.os.name.text": [ "Ubuntu" ], "kubernetes.labels.run": [ "sh" ], "host.hostname": [ "docker-desktop" ], "host.mac": [... ], "kubernetes.node.labels.kubernetes_io/os": [ "linux" ], "container.id": [ "c22f9c96f1f712ad6ba10dbff9090601ef820470b8a167f1e767580101b0b713" ], "container.image.name": [ "busybox:latest" ], "host.os.version": [ "20.04.6 LTS (Focal Fossa)" ], "kubernetes.node.labels.beta_kubernetes_io/os": [ "linux" ], "kubernetes.namespace": [ "default" ], "host.os.name": [ "Ubuntu" ], "agent.name": [ "docker-desktop" ], "host.name": [ "docker-desktop" ], "host.os.type": [ "linux" ], "input.type": [ "container" ], "log.offset": [ 5239 ], "agent.hostname": [ "docker-desktop" ], "host.architecture": [ "x86_64" ], "container.runtime": [ "docker" ], "agent.id": [ "f81c4dc3-3780-4df4-9e39-9ef757433dda" ], "ecs.version": [ "8.0.0" ], "host.containerized": [ false ], "kubernetes.node.labels.node-role_kubernetes_io/control-plane": [ "" ], "agent.version": [ "8.9.0" ], "host.os.family": [ "debian" ], "kubernetes.node.name": [ "docker-desktop" ], "kubernetes.node.hostname": [ "docker-desktop" ], "kubernetes.pod.uid": [ "bd5f5f29-5bd3-404f-a978-b60675231d19" ], "host.ip": [... ], "agent.type": [ "filebeat" ], "orchestrator.cluster.url": [ "vm.docker.internal:6443" ], "stream": [ "stdout" ], "host.os.kernel": [ "5.15.49-linuxkit" ], "kubernetes.pod.name": [ "sh" ], "kubernetes.pod.ip": [ "10.1.0.28" ], "kubernetes.container.name": [ "sh" ], "host.os.codename": [ "focal" ], "kubernetes.namespace_labels.kubernetes_io/metadata_name": [ "default" ], "message": [ "i: 36" ], "kubernetes.node.labels.kubernetes_io/hostname": [ "docker-desktop" ], "kubernetes.node.labels.beta_kubernetes_io/arch": [ "amd64" ], "@timestamp": [ "2023-05-31T09:21:38.957Z" ], "host.os.platform": [ "ubuntu" ], "log.file.path": [ "/var/log/containers/sh_default_sh-c22f9c96f1f712ad6ba10dbff9090601ef820470b8a167f1e767580101b0b713.log" ], "agent.ephemeral_id": [ "dbd4202d-afcd-4db7-b7d9-1a8781e8bde7" ], "kubernetes.node.labels.kubernetes_io/arch": [ "amd64" ], "kubernetes.node.labels.node_kubernetes_io/exclude-from-external-load-balancers": [ "" ] } } ```

looking into more details offset is properly persisted

michalpristas commented 1 year ago

attaching logs. what i dont understand is why i: 34 and i: 35 (which are forming a gap) appears after first filebeat is down and before inputs are started in a new instance logs attached Untitled discover search.csv

rdner commented 1 year ago

@pierrehilbert based on what @michalpristas described in https://github.com/elastic/beats/issues/35145#issuecomment-1569844220 it might be the processor dropping events due to some error (perhaps it's not ready to accept events due to initialisation in progress).

I believe if a processor returns an error we drop the event. In order to verify this we need to add more logging to the metadata processor and run through the reproduction steps.

michalpristas commented 1 year ago

if processor is dropping event it should not be in ES right? the logs i attached are extract from an ES index

cmacknz commented 1 year ago

container.id is set by the add_docker_metadata processor.

It does have a path where it fails to detect as running on Docker which you should be able to see in Debug logs: https://github.com/elastic/beats/blob/dd1ea21dcd259d8669ca095f6ea852ddf23a134c/libbeat/processors/add_docker_metadata/add_docker_metadata.go#L89

The actual Run function for the processor is also fairly complex with multiple paths that can end up not setting container.id but not much logging about what is happening: https://github.com/elastic/beats/blob/dd1ea21dcd259d8669ca095f6ea852ddf23a134c/libbeat/processors/add_docker_metadata/add_docker_metadata.go#L136

The Kubernetes metadata also seems to be affected, which is the add_kubernetes_metadata processor. This processor is quite complex, see https://github.com/elastic/beats/blob/dd1ea21dcd259d8669ca095f6ea852ddf23a134c/libbeat/processors/add_kubernetes_metadata/kubernetes.go#L142

The add_kubernetes_metadata processor adds several resource watches that might not be getting torn down properly on restart.

michalpristas commented 1 year ago

update

it seems dataloss outside of kubernetes i thought is dataloss was not a problem but events during filebeat shutdown just came in opposite order for some reason, but they are there

dataloss on kubernetes use case, is also not a dataloss. what's happening is that events are being collected, processed and pushed but add_kubernetes_metadata processor is not ready to enrich them because information about containers is not there

i see a lot of event from these two failure code branches

index := k.matchers.MetadataIndex(event.Fields)
if index == "" {
    k.log.Error("No container match string, not adding kubernetes data")
    return event, nil
}
metadata := k.cache.get(index)
if metadata == nil {
    k.log.Error("Index key %s did not match any of the cached resources", index)
    return event, nil
}

these errors stops after some time.

considering these 3 documents

Pre restart ``` { "_id": "_JkRdogB4JDF0-Lqf_wC", "_index": ".ds-filebeat-8.6.2-2023.06.01-000001", "_score": 0, "_source": { "@timestamp": "2023-06-01T08:27:16.977Z", "kubernetes": { "namespace_labels": { "kubernetes_io/metadata_name": "default" }, "labels": { "run": "sh" }, "container": { "name": "sh" }, "node": { "name": "docker-desktop", "uid": "36d1929a-41b5-44c5-aba8-f8d876bfb55c", "labels": { "kubernetes_io/hostname": "docker-desktop", "kubernetes_io/os": "linux", "node-role_kubernetes_io/control-plane": "", "node_kubernetes_io/exclude-from-external-load-balancers": "", "beta_kubernetes_io/arch": "arm64", "beta_kubernetes_io/os": "linux", "kubernetes_io/arch": "arm64" }, "hostname": "docker-desktop" }, "pod": { "name": "sh", "uid": "ce870776-7a49-4625-b1d2-7db7ee5f5561", "ip": "10.1.4.0" }, "namespace": "default", "namespace_uid": "94aed3b2-321e-49b9-a99b-fb112351870e" }, "orchestrator": { "cluster": { "name": "kubernetes", "url": "kubernetes.docker.internal:6443" } }, "ecs": { "version": "8.0.0" }, "host": { "name": "docker-desktop", "architecture": "aarch64", "os": { "kernel": "5.15.49-linuxkit", "codename": "focal", "type": "linux", "platform": "ubuntu", "version": "20.04.5 LTS (Focal Fossa)", "family": "debian", "name": "Ubuntu" }, "containerized": false, "ip": [ "..." ], "mac": [ "..." ], "hostname": "docker-desktop" }, "agent": { "id": "8d4dc8b4-9d51-47c3-a990-778eaa319196", "name": "docker-desktop", "type": "filebeat", "version": "8.6.2", "ephemeral_id": "88b3b041-b7fb-475d-827f-c001dc944c6b" }, "log": { "offset": 14444, "file": { "path": "/var/log/containers/sh_default_sh-a36726e24b6524250876ca758c16e7aa081bea0da847f2af6a2cf69af919166b.log" } }, "message": "i: 71", "container": { "image": { "name": "busybox:latest" }, "id": "a36726e24b6524250876ca758c16e7aa081bea0da847f2af6a2cf69af919166b", "runtime": "docker" }, "stream": "stdout", "input": { "type": "container" } }, "_version": 1, "fields": { "kubernetes.node.uid": [ "36d1929a-41b5-44c5-aba8-f8d876bfb55c" ], "orchestrator.cluster.name": [ "kubernetes" ], "kubernetes.namespace_uid": [ "94aed3b2-321e-49b9-a99b-fb112351870e" ], "host.os.name.text": [ "Ubuntu" ], "host.hostname": [ "docker-desktop" ], "kubernetes.labels.run": [ "sh" ], "kubernetes.node.labels.kubernetes_io/os": [ "linux" ], "host.mac": [ "..." ], "container.id": [ "a36726e24b6524250876ca758c16e7aa081bea0da847f2af6a2cf69af919166b" ], "container.image.name": [ "busybox:latest" ], "host.os.version": [ "20.04.5 LTS (Focal Fossa)" ], "kubernetes.namespace": [ "default" ], "kubernetes.node.labels.beta_kubernetes_io/os": [ "linux" ], "host.os.name": [ "Ubuntu" ], "agent.name": [ "docker-desktop" ], "host.name": [ "docker-desktop" ], "host.os.type": [ "linux" ], "input.type": [ "container" ], "log.offset": [ 14444 ], "agent.hostname": [ "docker-desktop" ], "host.architecture": [ "aarch64" ], "container.runtime": [ "docker" ], "agent.id": [ "8d4dc8b4-9d51-47c3-a990-778eaa319196" ], "ecs.version": [ "8.0.0" ], "host.containerized": [ false ], "kubernetes.node.labels.node-role_kubernetes_io/control-plane": [ "" ], "agent.version": [ "8.6.2" ], "host.os.family": [ "debian" ], "kubernetes.node.name": [ "docker-desktop" ], "kubernetes.node.hostname": [ "docker-desktop" ], "kubernetes.pod.uid": [ "ce870776-7a49-4625-b1d2-7db7ee5f5561" ], "host.ip": [ "..." ], "agent.type": [ "filebeat" ], "orchestrator.cluster.url": [ "kubernetes.docker.internal:6443" ], "stream": [ "stdout" ], "host.os.kernel": [ "5.15.49-linuxkit" ], "kubernetes.pod.name": [ "sh" ], "kubernetes.pod.ip": [ "10.1.4.0" ], "kubernetes.container.name": [ "sh" ], "host.os.codename": [ "focal" ], "kubernetes.namespace_labels.kubernetes_io/metadata_name": [ "default" ], "message": [ "i: 71" ], "kubernetes.node.labels.kubernetes_io/hostname": [ "docker-desktop" ], "kubernetes.node.labels.beta_kubernetes_io/arch": [ "arm64" ], "@timestamp": [ "2023-06-01T08:27:16.977Z" ], "host.os.platform": [ "ubuntu" ], "log.file.path": [ "/var/log/containers/sh_default_sh-a36726e24b6524250876ca758c16e7aa081bea0da847f2af6a2cf69af919166b.log" ], "agent.ephemeral_id": [ "88b3b041-b7fb-475d-827f-c001dc944c6b" ], "kubernetes.node.labels.kubernetes_io/arch": [ "arm64" ], "kubernetes.node.labels.node_kubernetes_io/exclude-from-external-load-balancers": [ "" ] } } ```
During restart ``` { "_id": "1-0RdogBu5qA32-zly1B", "_index": ".ds-filebeat-8.6.2-2023.06.01-000001", "_score": 0, "_source": { "@timestamp": "2023-06-01T08:27:18.980Z", "message": "i: 72", "input": { "type": "container" }, "ecs": { "version": "8.0.0" }, "host": { "name": "docker-desktop", "hostname": "docker-desktop", "architecture": "aarch64", "os": { "kernel": "5.15.49-linuxkit", "codename": "focal", "type": "linux", "platform": "ubuntu", "version": "20.04.5 LTS (Focal Fossa)", "family": "debian", "name": "Ubuntu" }, "containerized": false, "ip": [ "..." ], "mac": [ "..." ] }, "agent": { "type": "filebeat", "version": "8.6.2", "ephemeral_id": "8ce6c900-20f3-4100-8d8d-5add650f7a29", "id": "8d4dc8b4-9d51-47c3-a990-778eaa319196", "name": "docker-desktop" }, "log": { "file": { "path": "/var/log/containers/sh_default_sh-a36726e24b6524250876ca758c16e7aa081bea0da847f2af6a2cf69af919166b.log" }, "offset": 14522 }, "stream": "stdout" }, "_version": 1, "fields": { "host.os.name.text": [ "Ubuntu" ], "host.hostname": [ "docker-desktop" ], "host.mac": [ "..." ], "host.ip": [ "..." ], "agent.type": [ "filebeat" ], "host.os.version": [ "20.04.5 LTS (Focal Fossa)" ], "stream": [ "stdout" ], "host.os.kernel": [ "5.15.49-linuxkit" ], "host.os.name": [ "Ubuntu" ], "agent.name": [ "docker-desktop" ], "host.name": [ "docker-desktop" ], "host.os.type": [ "linux" ], "host.os.codename": [ "focal" ], "input.type": [ "container" ], "log.offset": [ 14522 ], "agent.hostname": [ "docker-desktop" ], "message": [ "i: 72" ], "host.architecture": [ "aarch64" ], "@timestamp": [ "2023-06-01T08:27:18.980Z" ], "agent.id": [ "8d4dc8b4-9d51-47c3-a990-778eaa319196" ], "host.os.platform": [ "ubuntu" ], "ecs.version": [ "8.0.0" ], "host.containerized": [ false ], "log.file.path": [ "/var/log/containers/sh_default_sh-a36726e24b6524250876ca758c16e7aa081bea0da847f2af6a2cf69af919166b.log" ], "agent.ephemeral_id": [ "8ce6c900-20f3-4100-8d8d-5add650f7a29" ], "agent.version": [ "8.6.2" ], "host.os.family": [ "debian" ] } } ```
after restart ``` { "_id": "7-0RdogBu5qA32-zmC3L", "_index": ".ds-filebeat-8.6.2-2023.06.01-000001", "_score": 0, "_source": { "@timestamp": "2023-06-01T08:27:22.986Z", "kubernetes": { "labels": { "run": "sh" }, "container": { "name": "sh" }, "node": { "name": "docker-desktop", "uid": "36d1929a-41b5-44c5-aba8-f8d876bfb55c", "labels": { "node-role_kubernetes_io/control-plane": "", "node_kubernetes_io/exclude-from-external-load-balancers": "", "beta_kubernetes_io/arch": "arm64", "beta_kubernetes_io/os": "linux", "kubernetes_io/arch": "arm64", "kubernetes_io/hostname": "docker-desktop", "kubernetes_io/os": "linux" }, "hostname": "docker-desktop" }, "pod": { "uid": "ce870776-7a49-4625-b1d2-7db7ee5f5561", "ip": "10.1.4.0", "name": "sh" }, "namespace": "default", "namespace_uid": "94aed3b2-321e-49b9-a99b-fb112351870e", "namespace_labels": { "kubernetes_io/metadata_name": "default" } }, "log": { "offset": 14678, "file": { "path": "/var/log/containers/sh_default_sh-a36726e24b6524250876ca758c16e7aa081bea0da847f2af6a2cf69af919166b.log" } }, "stream": "stdout", "input": { "type": "container" }, "container": { "image": { "name": "busybox:latest" }, "id": "a36726e24b6524250876ca758c16e7aa081bea0da847f2af6a2cf69af919166b", "runtime": "docker" }, "orchestrator": { "cluster": { "url": "kubernetes.docker.internal:6443", "name": "kubernetes" } }, "agent": { "version": "8.6.2", "ephemeral_id": "8ce6c900-20f3-4100-8d8d-5add650f7a29", "id": "8d4dc8b4-9d51-47c3-a990-778eaa319196", "name": "docker-desktop", "type": "filebeat" }, "ecs": { "version": "8.0.0" }, "host": { "hostname": "docker-desktop", "architecture": "aarch64", "os": { "name": "Ubuntu", "kernel": "5.15.49-linuxkit", "codename": "focal", "type": "linux", "platform": "ubuntu", "version": "20.04.5 LTS (Focal Fossa)", "family": "debian" }, "containerized": false, "name": "docker-desktop", "ip": [ "..." ], "mac": [ "..." ] }, "message": "i: 74" }, "_version": 1, "fields": { "orchestrator.cluster.name": [ "kubernetes" ], "kubernetes.node.uid": [ "36d1929a-41b5-44c5-aba8-f8d876bfb55c" ], "kubernetes.namespace_uid": [ "94aed3b2-321e-49b9-a99b-fb112351870e" ], "host.os.name.text": [ "Ubuntu" ], "kubernetes.labels.run": [ "sh" ], "host.hostname": [ "docker-desktop" ], "kubernetes.node.labels.kubernetes_io/os": [ "linux" ], "host.mac": [ "..." ], "container.id": [ "a36726e24b6524250876ca758c16e7aa081bea0da847f2af6a2cf69af919166b" ], "container.image.name": [ "busybox:latest" ], "host.os.version": [ "20.04.5 LTS (Focal Fossa)" ], "kubernetes.namespace": [ "default" ], "kubernetes.node.labels.beta_kubernetes_io/os": [ "linux" ], "host.os.name": [ "Ubuntu" ], "agent.name": [ "docker-desktop" ], "host.name": [ "docker-desktop" ], "host.os.type": [ "linux" ], "input.type": [ "container" ], "log.offset": [ 14678 ], "agent.hostname": [ "docker-desktop" ], "host.architecture": [ "aarch64" ], "container.runtime": [ "docker" ], "agent.id": [ "8d4dc8b4-9d51-47c3-a990-778eaa319196" ], "ecs.version": [ "8.0.0" ], "host.containerized": [ false ], "kubernetes.node.labels.node-role_kubernetes_io/control-plane": [ "" ], "agent.version": [ "8.6.2" ], "host.os.family": [ "debian" ], "kubernetes.node.name": [ "docker-desktop" ], "kubernetes.node.hostname": [ "docker-desktop" ], "kubernetes.pod.uid": [ "ce870776-7a49-4625-b1d2-7db7ee5f5561" ], "host.ip": [ "..." ], "orchestrator.cluster.url": [ "kubernetes.docker.internal:6443" ], "agent.type": [ "filebeat" ], "stream": [ "stdout" ], "host.os.kernel": [ "5.15.49-linuxkit" ], "kubernetes.pod.name": [ "sh" ], "kubernetes.pod.ip": [ "10.1.4.0" ], "kubernetes.container.name": [ "sh" ], "host.os.codename": [ "focal" ], "kubernetes.namespace_labels.kubernetes_io/metadata_name": [ "default" ], "message": [ "i: 74" ], "kubernetes.node.labels.kubernetes_io/hostname": [ "docker-desktop" ], "kubernetes.node.labels.beta_kubernetes_io/arch": [ "arm64" ], "@timestamp": [ "2023-06-01T08:27:22.986Z" ], "host.os.platform": [ "ubuntu" ], "log.file.path": [ "/var/log/containers/sh_default_sh-a36726e24b6524250876ca758c16e7aa081bea0da847f2af6a2cf69af919166b.log" ], "agent.ephemeral_id": [ "8ce6c900-20f3-4100-8d8d-5add650f7a29" ], "kubernetes.node.labels.kubernetes_io/arch": [ "arm64" ], "kubernetes.node.labels.node_kubernetes_io/exclude-from-external-load-balancers": [ "" ] } } ```

looking at ephemeral id there's no doubt that a 'gapped' document is processed by new filebeat for first document "ephemeral_id": "88b3b041-b7fb-475d-827f-c001dc944c6b" for the other two "ephemeral_id": "8ce6c900-20f3-4100-8d8d-5add650f7a29"

as this is not a data loss issue I'd decrease priority on this one and as it's add_kubernetes_metadata related I'd like to pass it on on processor owners as my way of fixing this would be to block init until k8s is available and then force pod information reload in some way. Not sure if and how this solution would backfire.

jlind23 commented 1 year ago

Thanks @michalpristas for the update. @bturquet could you please take this over as it is related to the add_kubernetes_metadata processor?

gizas commented 1 year ago

Thank you for the details here. I am gonna try to reproduce it. One question for the scenario @michalpristas , in order to understand the flow:

You create a new log file while the add_kubernetes_metadata processor is not yet initialised. And during this time because the metadata enrichment is failing, we also actually fail to enrich the file and we drop it. Is that right? And by dropping, I mean that we dont retry to read it again. Do I understand correctly?

michalpristas commented 1 year ago

my flow is:

if you use filter for image.name or something similar to filter out busybox entries there will be gap of 1-3 events. if you remove filter and watch for all event during the time of restarts events ARE present, they are not dropped, they are just lacking kubernetes metadata, so they were processed we just hit the conditioned mentioned in my previous comment inside add_kubernetes_metadata processor