Open michalpristas opened 1 year ago
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
Confirmed to be reproducible upgrading from 8.6.2 to 8.7.0
https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/
With RollingUpdate update strategy, after you update a DaemonSet template, old DaemonSet pods will be killed, and new DaemonSet pods will be created automatically, in a controlled fashion. At most one pod of the DaemonSet will be running on each node during the whole update process.
The “at most one pod of the DaemonSet will be running” implies there must be a period where no Filebeat pod is running on the node, creating a window where logs could be missed. The thing is if the logs were persisted Filebeat should pick them up.
During the pod replacement there must be a window where events can be dropped.
Is the solution to enforce some min requirement for .spec.minReadySeconds
and .spec.updateStrategy.rollingUpdate.maxSurge
?
verified this issue is NOT kubernetes specific. it happens on restarts as well (not just upgrades).
when filebeat runnin on bare metal is restarted i see 2-3 events dropped (generated by the script in a description)
@belimawr / @rdner could it be events we harvested but not already sent to the output and therefore lost when we are restarting Filebeat?
The offset of the files on the registry is only updated when the output acknowledges the events, hence they should already be safely stored on ES. log
input and filestream
input handle that differently, but they should achieve the same final result.
On a restart there should be no data loss. That is indeed pretty weird. @michalpristas could you share the filebeat.yml
you used to reproduce it with a restart? I'm curious about both:
today i moved a bit forward. i did some changes, what seems like data loss is probably processor not being active when i filter documents based on container id or container name i see gaps but when i dont filter at all and go through all of the document i can find them without metadata attached
looking into more details offset is properly persisted
attaching logs.
what i dont understand is why i: 34
and i: 35
(which are forming a gap) appears after first filebeat is down and before inputs are started in a new instance
logs attached
Untitled discover search.csv
@pierrehilbert based on what @michalpristas described in https://github.com/elastic/beats/issues/35145#issuecomment-1569844220 it might be the processor dropping events due to some error (perhaps it's not ready to accept events due to initialisation in progress).
I believe if a processor returns an error we drop the event. In order to verify this we need to add more logging to the metadata processor and run through the reproduction steps.
if processor is dropping event it should not be in ES right? the logs i attached are extract from an ES index
container.id
is set by the add_docker_metadata
processor.
It does have a path where it fails to detect as running on Docker which you should be able to see in Debug logs: https://github.com/elastic/beats/blob/dd1ea21dcd259d8669ca095f6ea852ddf23a134c/libbeat/processors/add_docker_metadata/add_docker_metadata.go#L89
The actual Run function for the processor is also fairly complex with multiple paths that can end up not setting container.id
but not much logging about what is happening: https://github.com/elastic/beats/blob/dd1ea21dcd259d8669ca095f6ea852ddf23a134c/libbeat/processors/add_docker_metadata/add_docker_metadata.go#L136
The Kubernetes metadata also seems to be affected, which is the add_kubernetes_metadata
processor. This processor is quite complex, see https://github.com/elastic/beats/blob/dd1ea21dcd259d8669ca095f6ea852ddf23a134c/libbeat/processors/add_kubernetes_metadata/kubernetes.go#L142
The add_kubernetes_metadata
processor adds several resource watches that might not be getting torn down properly on restart.
update
it seems dataloss outside of kubernetes i thought is dataloss was not a problem but events during filebeat shutdown just came in opposite order for some reason, but they are there
dataloss on kubernetes use case, is also not a dataloss. what's happening is that events are being collected, processed and pushed but add_kubernetes_metadata
processor is not ready to enrich them because information about containers is not there
i see a lot of event from these two failure code branches
index := k.matchers.MetadataIndex(event.Fields)
if index == "" {
k.log.Error("No container match string, not adding kubernetes data")
return event, nil
}
metadata := k.cache.get(index)
if metadata == nil {
k.log.Error("Index key %s did not match any of the cached resources", index)
return event, nil
}
these errors stops after some time.
considering these 3 documents
looking at ephemeral id there's no doubt that a 'gapped' document is processed by new filebeat
for first document "ephemeral_id": "88b3b041-b7fb-475d-827f-c001dc944c6b"
for the other two "ephemeral_id": "8ce6c900-20f3-4100-8d8d-5add650f7a29"
as this is not a data loss issue I'd decrease priority on this one and as it's add_kubernetes_metadata
related I'd like to pass it on on processor owners as my way of fixing this would be to block init until k8s is available and then force pod information reload in some way. Not sure if and how this solution would backfire.
Thanks @michalpristas for the update.
@bturquet could you please take this over as it is related to the add_kubernetes_metadata
processor?
Thank you for the details here. I am gonna try to reproduce it. One question for the scenario @michalpristas , in order to understand the flow:
You create a new log file while the add_kubernetes_metadata processor is not yet initialised. And during this time because the metadata enrichment is failing, we also actually fail to enrich the file and we drop it. Is that right? And by dropping, I mean that we dont retry to read it again. Do I understand correctly?
my flow is:
curl -L -O https://raw.githubusercontent.com/elastic/beats/8.7/deploy/kubernetes/filebeat-kubernetes.yaml
)kubectl run -it --image=busybox -- sh
, create a script file mentioned in the description and keep it runningif you use filter for image.name or something similar to filter out busybox entries there will be gap of 1-3 events.
if you remove filter and watch for all event during the time of restarts events ARE present, they are not dropped, they are just lacking kubernetes metadata, so they were processed we just hit the conditioned mentioned in my previous comment inside add_kubernetes_metadata
processor
Issue
Tested scenario:
curl -L -O https://raw.githubusercontent.com/elastic/beats/8.7/deploy/kubernetes/filebeat-kubernetes.yaml
kubectl create -f filebeat-kubernetes.yaml
kubectl apply -f filebeat-kubernetes.yaml
script used
also reproducible on 8.6.2->8.7.0
Definition of done