While addressing https://github.com/elastic/beats/issues/23139 Node and Namespace watchers were added in Autodicover to make Node/Namespace metadata changes instantly available to the related Pods. Currently, when a Node modification is received via a watcher, all Pods located on this Node are retrieved from a watcher store (src) and stop and start Pod events are emitted (src). A single Pod can emit multiple start and stop events (src). This happens regardless of the nature of the change of the Node object.
In the environment specified in Setup section, it was observed that Node objects can be updated as frequently as every 10 seconds. Filebeat logs do not allow to determine what exactly was received via watch APIs, but this can be verified as follows:
Node objects were modified in .status.conditions[] array which is related to Node monitoring process. The only field modified was lastHeartbeatTime. Here's an example of message received via watch API:
```
{
"type": "MODIFIED",
"object": {
"kind": "Node",
"apiVersion": "v1",
"metadata": {
"name": "", ...
},
"spec": { ...
},
"status": {
"capacity": { ...
},
"allocatable": { ...
},
"conditions": [
{
"type": "FilesystemCorruptionProblem",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
"lastTransitionTime": "2023-02-09T07:41:54Z",
"reason": "FilesystemIsOK",
"message": "Filesystem is healthy"
},
{
"type": "ContainerRuntimeProblem",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
"lastTransitionTime": "2023-02-10T20:37:25Z",
"reason": "ContainerRuntimeIsUp",
"message": "container runtime service is up"
},
{
"type": "KubeletProblem",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
"lastTransitionTime": "2023-02-10T20:37:25Z",
"reason": "KubeletIsUp",
"message": "kubelet service is up"
},
{
"type": "FrequentDockerRestart",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
"lastTransitionTime": "2023-02-09T07:41:54Z",
"reason": "NoFrequentDockerRestart",
"message": "docker is functioning properly"
},
{
"type": "FrequentUnregisterNetDevice",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
"lastTransitionTime": "2023-02-09T07:41:54Z",
"reason": "NoFrequentUnregisterNetDevice",
"message": "node is functioning properly"
},
{
"type": "VMEventScheduled",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
"lastTransitionTime": "2023-02-10T20:37:45Z",
"reason": "NoVMEventScheduled",
"message": "VM has no scheduled event"
},
{
"type": "ReadonlyFilesystem",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
"lastTransitionTime": "2023-02-09T07:41:54Z",
"reason": "FilesystemIsNotReadOnly",
"message": "Filesystem is not read-only"
},
{
"type": "FrequentContainerdRestart",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
"lastTransitionTime": "2023-02-09T07:41:54Z",
"reason": "NoFrequentContainerdRestart",
"message": "containerd is functioning properly"
},
{
"type": "KernelDeadlock",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
"lastTransitionTime": "2023-02-09T07:41:54Z",
"reason": "KernelHasNoDeadlock",
"message": "kernel has no deadlock"
},
{
"type": "FrequentKubeletRestart",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
"lastTransitionTime": "2023-02-09T07:41:54Z",
"reason": "NoFrequentKubeletRestart",
"message": "kubelet is functioning properly"
},
{
"type": "MemoryPressure",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE
"lastTransitionTime": "2023-02-10T20:38:18Z",
"reason": "KubeletHasSufficientMemory",
"message": "kubelet has sufficient memory available"
},
{
"type": "DiskPressure",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE
"lastTransitionTime": "2023-02-10T20:38:18Z",
"reason": "KubeletHasNoDiskPressure",
"message": "kubelet has no disk pressure"
},
{
"type": "PIDPressure",
"status": "False",
"lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE
"lastTransitionTime": "2023-02-10T20:38:18Z",
"reason": "KubeletHasSufficientPID",
"message": "kubelet has sufficient PID available"
},
{
"type": "Ready",
"status": "True",
"lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE
"lastTransitionTime": "2023-02-10T20:38:18Z",
"reason": "KubeletReady",
"message": "kubelet is posting ready status. AppArmor enabled"
}
],
"addresses": [ ...
],
"daemonEndpoints": { ...
},
"nodeInfo": { ...
},
"images": [ ...
],
"volumesInUse": [ ...
],
"volumesAttached": [ ...
]
}
}
}
```
As Node metadata is not modified, the entire configuration reload process that follows is unnecessary.
Here's an excerpt from Filebeat debug log that shows that every 10s a sequence of 4 stop and 4 start events is emitted, just for a single Pod (there are many Pods on a single Node, so this is multiplied):
2023-02-14T12:03:02.406Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.406Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.604Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.604Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.605Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.607Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.091Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.092Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.093Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.093Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.099Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.100Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.101Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.102Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.415Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.415Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.416Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.417Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.715Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.716Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.717Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.718Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
With the above template configuration most of these start/stop should be ignored, but not events for Kibana Pods. Here is an interesting debug log excerpt which shows a sequence of stop and start events that lead to a runner error potentially caused by log input race condition (see https://github.com/elastic/beats/issues/34388#issuecomment-1439816785). Once such runner error occurs Autodiscover worker initiates a configuration reload on every received event (src).
2023-02-14T12:03:02.406Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.406Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.406Z cfgfile/list.go 64 Starting reload procedure, current runners: 3
2023-02-14T12:03:02.406Z cfgfile/list.go 82 Start list: 0, Stop list: 1
2023-02-14T12:03:02.407Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z cfgfile/list.go 64 Starting reload procedure, current runners: 2
2023-02-14T12:03:02.407Z cfgfile/list.go 82 Start list: 0, Stop list: 1
2023-02-14T12:03:02.407Z autodiscover/autodiscover.go 267 Got a stop event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z cfgfile/list.go 64 Starting reload procedure, current runners: 1
2023-02-14T12:03:02.407Z cfgfile/list.go 82 Start list: 0, Stop list: 1
2023-02-14T12:03:02.604Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.604Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.605Z cfgfile/list.go 64 Starting reload procedure, current runners: 0
2023-02-14T12:03:02.605Z cfgfile/list.go 82 Start list: 1, Stop list: 0
2023-02-14T12:03:02.605Z cfgfile/list.go 107 Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileStateOS: 4927659-2049}
2023-02-14T12:03:02.605Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.606Z autodiscover/autodiscover.go 156 Reloading existing autodiscover configs after error
2023-02-14T12:03:02.606Z cfgfile/list.go 64 Starting reload procedure, current runners: 0
2023-02-14T12:03:02.606Z cfgfile/list.go 82 Start list: 2, Stop list: 0
2023-02-14T12:03:02.607Z cfgfile/list.go 107 Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileStateOS: 4927659-2049}
2023-02-14T12:03:02.607Z autodiscover/autodiscover.go 182 Got a start event. esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.608Z autodiscover/autodiscover.go 156 Reloading existing autodiscover configs after error
2023-02-14T12:03:02.608Z cfgfile/list.go 64 Starting reload procedure, current runners: 1
2023-02-14T12:03:02.608Z cfgfile/list.go 82 Start list: 2, Stop list: 0
2023-02-14T12:03:02.609Z cfgfile/list.go 107 Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileStateOS: 4927659-2049}
2023-02-14T12:03:02.609Z autodiscover/autodiscover.go 267 Got a stop event. aad-pod-identity-nmi-5jm82
2023-02-14T12:03:02.609Z autodiscover/autodiscover.go 156 Reloading existing autodiscover configs after error <--- HERE
2023-02-14T12:03:02.609Z cfgfile/list.go 64 Starting reload procedure, current runners: 2
2023-02-14T12:03:02.609Z cfgfile/list.go 82 Start list: 1, Stop list: 0
2023-02-14T12:03:02.610Z cfgfile/list.go 107 Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileS
tateOS: 4927659-2049}
2023-02-14T12:03:02.610Z autodiscover/autodiscover.go 267 Got a stop event. aad-pod-identity-nmi-5jm82
2023-02-14T12:03:02.610Z autodiscover/autodiscover.go 156 Reloading existing autodiscover configs after error <--- HERE
Ultimately, the following frequency of configuration reloads was observed:
Setup
Observed in Filebeat 8.5.3 running in Azure Kubernetes
v1.24.6
with the following autodiscover configuration:Context
While addressing https://github.com/elastic/beats/issues/23139 Node and Namespace watchers were added in Autodicover to make Node/Namespace metadata changes instantly available to the related Pods. Currently, when a Node modification is received via a watcher, all Pods located on this Node are retrieved from a watcher store (src) and stop and start Pod events are emitted (src). A single Pod can emit multiple start and stop events (src). This happens regardless of the nature of the change of the Node object.
In the environment specified in
Setup
section, it was observed that Node objects can be updated as frequently as every 10 seconds. Filebeat logs do not allow to determine what exactly was received via watch APIs, but this can be verified as follows:Node objects were modified in
.status.conditions[]
array which is related to Node monitoring process. The only field modified waslastHeartbeatTime
. Here's an example of message received via watch API:As Node metadata is not modified, the entire configuration reload process that follows is unnecessary.
Here's an excerpt from Filebeat debug log that shows that every 10s a sequence of 4 stop and 4 start events is emitted, just for a single Pod (there are many Pods on a single Node, so this is multiplied):
With the above template configuration most of these start/stop should be ignored, but not events for Kibana Pods. Here is an interesting debug log excerpt which shows a sequence of stop and start events that lead to a runner error potentially caused by
log
input race condition (see https://github.com/elastic/beats/issues/34388#issuecomment-1439816785). Once such runner error occurs Autodiscover worker initiates a configuration reload on every received event (src).Ultimately, the following frequency of configuration reloads was observed:
Enhancement request
Autodiscover should filter Node and Namespace object modifications that are not changing its metadata.
Workarounds
If
hints
are not used, Node and Namespace watchers can be disabled entirely (src) with the following configuration: