elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
94 stars 4.92k forks source link

Make Autodiscover more selective when watching K8s events #34717

Closed gbanasiak closed 9 months ago

gbanasiak commented 1 year ago

Setup

Observed in Filebeat 8.5.3 running in Azure Kubernetes v1.24.6 with the following autodiscover configuration:

``` filebeat: autodiscover: providers: - templates: - condition: or: - equals: kubernetes: container: name: elasticsearch - equals: kubernetes: container: name: kibana config: - paths: - /var/log/pods/${data.kubernetes.namespace}_${data.kubernetes.pod.name}_${data.kubernetes.pod.uid}/${data.kubernetes.container.name}/*.log processors: - decode_json_fields: add_error_key: true fields: - message max_depth: 1 overwrite_keys: true process_array: false target: "" type: container type: kubernetes ```

Context

While addressing https://github.com/elastic/beats/issues/23139 Node and Namespace watchers were added in Autodicover to make Node/Namespace metadata changes instantly available to the related Pods. Currently, when a Node modification is received via a watcher, all Pods located on this Node are retrieved from a watcher store (src) and stop and start Pod events are emitted (src). A single Pod can emit multiple start and stop events (src). This happens regardless of the nature of the change of the Node object.

In the environment specified in Setup section, it was observed that Node objects can be updated as frequently as every 10 seconds. Filebeat logs do not allow to determine what exactly was received via watch APIs, but this can be verified as follows:

kubectl proxy &
stdbuf -oL curl -s '127.0.0.1:8001/api/v1/namespaces?allowWatchBookmarks=true&watch=true&pretty=false' > namespaces.log &
stdbuf -oL curl -s '127.0.0.1:8001/api/v1/nodes?allowWatchBookmarks=true&watch=true&pretty=false' > nodes.log &
stdbuf -oL curl -s '127.0.0.1:8001/api/v1/pods?allowWatchBookmarks=true&watch=true&pretty=false' > pods.log &

Node objects were modified in .status.conditions[] array which is related to Node monitoring process. The only field modified was lastHeartbeatTime. Here's an example of message received via watch API:

``` { "type": "MODIFIED", "object": { "kind": "Node", "apiVersion": "v1", "metadata": { "name": "", ... }, "spec": { ... }, "status": { "capacity": { ... }, "allocatable": { ... }, "conditions": [ { "type": "FilesystemCorruptionProblem", "status": "False", "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE "lastTransitionTime": "2023-02-09T07:41:54Z", "reason": "FilesystemIsOK", "message": "Filesystem is healthy" }, { "type": "ContainerRuntimeProblem", "status": "False", "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE "lastTransitionTime": "2023-02-10T20:37:25Z", "reason": "ContainerRuntimeIsUp", "message": "container runtime service is up" }, { "type": "KubeletProblem", "status": "False", "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE "lastTransitionTime": "2023-02-10T20:37:25Z", "reason": "KubeletIsUp", "message": "kubelet service is up" }, { "type": "FrequentDockerRestart", "status": "False", "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE "lastTransitionTime": "2023-02-09T07:41:54Z", "reason": "NoFrequentDockerRestart", "message": "docker is functioning properly" }, { "type": "FrequentUnregisterNetDevice", "status": "False", "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE "lastTransitionTime": "2023-02-09T07:41:54Z", "reason": "NoFrequentUnregisterNetDevice", "message": "node is functioning properly" }, { "type": "VMEventScheduled", "status": "False", "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE "lastTransitionTime": "2023-02-10T20:37:45Z", "reason": "NoVMEventScheduled", "message": "VM has no scheduled event" }, { "type": "ReadonlyFilesystem", "status": "False", "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE "lastTransitionTime": "2023-02-09T07:41:54Z", "reason": "FilesystemIsNotReadOnly", "message": "Filesystem is not read-only" }, { "type": "FrequentContainerdRestart", "status": "False", "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE "lastTransitionTime": "2023-02-09T07:41:54Z", "reason": "NoFrequentContainerdRestart", "message": "containerd is functioning properly" }, { "type": "KernelDeadlock", "status": "False", "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE "lastTransitionTime": "2023-02-09T07:41:54Z", "reason": "KernelHasNoDeadlock", "message": "kernel has no deadlock" }, { "type": "FrequentKubeletRestart", "status": "False", "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE "lastTransitionTime": "2023-02-09T07:41:54Z", "reason": "NoFrequentKubeletRestart", "message": "kubelet is functioning properly" }, { "type": "MemoryPressure", "status": "False", "lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE "lastTransitionTime": "2023-02-10T20:38:18Z", "reason": "KubeletHasSufficientMemory", "message": "kubelet has sufficient memory available" }, { "type": "DiskPressure", "status": "False", "lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE "lastTransitionTime": "2023-02-10T20:38:18Z", "reason": "KubeletHasNoDiskPressure", "message": "kubelet has no disk pressure" }, { "type": "PIDPressure", "status": "False", "lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE "lastTransitionTime": "2023-02-10T20:38:18Z", "reason": "KubeletHasSufficientPID", "message": "kubelet has sufficient PID available" }, { "type": "Ready", "status": "True", "lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE "lastTransitionTime": "2023-02-10T20:38:18Z", "reason": "KubeletReady", "message": "kubelet is posting ready status. AppArmor enabled" } ], "addresses": [ ... ], "daemonEndpoints": { ... }, "nodeInfo": { ... }, "images": [ ... ], "volumesInUse": [ ... ], "volumesAttached": [ ... ] } } } ```

As Node metadata is not modified, the entire configuration reload process that follows is unnecessary.

Here's an excerpt from Filebeat debug log that shows that every 10s a sequence of 4 stop and 4 start events is emitted, just for a single Pod (there are many Pods on a single Node, so this is multiplied):

2023-02-14T12:03:02.406Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.406Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.604Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.604Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.605Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.607Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.091Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.092Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.093Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.093Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.099Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.100Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.101Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.102Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.415Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.415Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.416Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.417Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.715Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.716Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.717Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.718Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984

With the above template configuration most of these start/stop should be ignored, but not events for Kibana Pods. Here is an interesting debug log excerpt which shows a sequence of stop and start events that lead to a runner error potentially caused by log input race condition (see https://github.com/elastic/beats/issues/34388#issuecomment-1439816785). Once such runner error occurs Autodiscover worker initiates a configuration reload on every received event (src).

2023-02-14T12:03:02.406Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.406Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.406Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 3   
2023-02-14T12:03:02.406Z                       cfgfile/list.go  82      Start list: 0, Stop list: 1     
2023-02-14T12:03:02.407Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 2   
2023-02-14T12:03:02.407Z                       cfgfile/list.go  82      Start list: 0, Stop list: 1     
2023-02-14T12:03:02.407Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 1   
2023-02-14T12:03:02.407Z                       cfgfile/list.go  82      Start list: 0, Stop list: 1     
2023-02-14T12:03:02.604Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.604Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.605Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 0   
2023-02-14T12:03:02.605Z                       cfgfile/list.go  82      Start list: 1, Stop list: 0     
2023-02-14T12:03:02.605Z                       cfgfile/list.go  107     Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileStateOS: 4927659-2049}    
2023-02-14T12:03:02.605Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.606Z          autodiscover/autodiscover.go  156     Reloading existing autodiscover configs after error     
2023-02-14T12:03:02.606Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 0   
2023-02-14T12:03:02.606Z                       cfgfile/list.go  82      Start list: 2, Stop list: 0     
2023-02-14T12:03:02.607Z                       cfgfile/list.go  107     Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileStateOS: 4927659-2049}    
2023-02-14T12:03:02.607Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.608Z          autodiscover/autodiscover.go  156     Reloading existing autodiscover configs after error     
2023-02-14T12:03:02.608Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 1   
2023-02-14T12:03:02.608Z                       cfgfile/list.go  82      Start list: 2, Stop list: 0     
2023-02-14T12:03:02.609Z                       cfgfile/list.go  107     Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileStateOS: 4927659-2049}    
2023-02-14T12:03:02.609Z          autodiscover/autodiscover.go  267     Got a stop event.       aad-pod-identity-nmi-5jm82
2023-02-14T12:03:02.609Z          autodiscover/autodiscover.go  156     Reloading existing autodiscover configs after error <--- HERE 
2023-02-14T12:03:02.609Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 2   
2023-02-14T12:03:02.609Z                       cfgfile/list.go  82      Start list: 1, Stop list: 0     
2023-02-14T12:03:02.610Z                       cfgfile/list.go  107     Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileS
tateOS: 4927659-2049}    
2023-02-14T12:03:02.610Z          autodiscover/autodiscover.go  267     Got a stop event.       aad-pod-identity-nmi-5jm82
2023-02-14T12:03:02.610Z          autodiscover/autodiscover.go  156     Reloading existing autodiscover configs after error <--- HERE

Ultimately, the following frequency of configuration reloads was observed:

   1 2023-02-14T12:03:01
  66 2023-02-14T12:03:02
 113 2023-02-14T12:03:12 <--- HERE
  27 2023-02-14T12:03:13
   6 2023-02-14T12:03:22
   5 2023-02-14T12:03:32
  79 2023-02-14T12:03:33
   1 2023-02-14T12:03:42
  28 2023-02-14T12:03:43
   1 2023-02-14T12:03:52
  66 2023-02-14T12:03:53
...

Enhancement request

Autodiscover should filter Node and Namespace object modifications that are not changing its metadata.

Workarounds

If hints are not used, Node and Namespace watchers can be disabled entirely (src) with the following configuration:

filebeat:
  autodiscover:
    providers:
    - add_resource_metadata:
        namespace:
          enabled: false
        node:
          enabled: false
      hints.enabled: false
ChrsMark commented 1 year ago

FYI @gizas @rameshelastic @mlunadia

gizas commented 9 months ago

@gbanasiak closing this as completed after https://github.com/elastic/beats/issues/37338