Closed HaveFun83 closed 1 month ago
@ryanohnemus can this be related to the newer watcher ?
@edsiper I would not expect the getaddrinfo error to be from the streaming changes. The error there is on plugin init, which is well before the kubernetes api watch occurs. Were there any underlying network changes between 3.0.7 and 3.1.3 that could impact getaddrinfo?
@HaveFun83 from your additional context log that error is when the kube api disconnects your watch stream, which can be expected to align with k8s api max call time limits. Is it reconnecting and getting frozen (no new event updates) or are there any other errors?
I'll add some tests to see if I can recreate the issue. Any extra details would be appreciated!
I enabled debug logging and got this here:
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] Configuration:
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] flush time | 1.000000 seconds
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] grace | 5 seconds
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] daemon | 0
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] ___________
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] inputs:
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] kubernetes_events
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] ___________
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] filters:
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] nest.0
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] nest.1
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] lua.2
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] modify.3
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] grep.4
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] modify.5
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] ___________
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] outputs:
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] gelf.0
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] ___________
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] collectors:
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [fluent bit] version=3.1.4, commit=c767acfe6c, pid=1
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [cmetrics] version=0.9.1
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [ctraces ] version=0.5.2
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [input:kubernetes_events:kubernetes_events.0] initializing
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [input:kubernetes_events:kubernetes_events.0] storage_strategy='memory' (memory only)
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [input:kubernetes_events:kubernetes_events.0] API server: https://kubernetes.default.svc:443
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [debug] [input:kubernetes_events:kubernetes_events.0] [thread init] initialization OK
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [input:kubernetes_events:kubernetes_events.0] thread instance initialized
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [debug] [kubernetes_events:kubernetes_events.0] created event channels: read=30 write=31
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [debug] [filter:modify:modify.3] Initialized modify filter with 0 conditions and 4 rules
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [filter:grep:grep.4] OR mode
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [debug] [filter:modify:modify.5] Initialized modify filter with 0 conditions and 1 rules
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [debug] [gelf:gelf.0] created event channels: read=34 write=35
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [sp] stream processor started
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:20] [debug] [input coro] destroy coro_id=1
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:20] [debug] [input coro] destroy coro_id=2
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:21] [debug] [input coro] destroy coro_id=3
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:21] [debug] [input coro] destroy coro_id=4
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:22] [debug] [input coro] destroy coro_id=5
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:22] [debug] [input coro] destroy coro_id=6
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:23] [debug] [input coro] destroy coro_id=7
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:23] [debug] [input coro] destroy coro_id=8
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:24] [debug] [input coro] destroy coro_id=9
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:24] [debug] [input coro] destroy coro_id=10
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:25] [debug] [input coro] destroy coro_id=11
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:25] [debug] [input coro] destroy coro_id=12
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:26] [debug] [input coro] destroy coro_id=13
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:26] [debug] [input coro] destroy coro_id=14
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:27] [debug] [input coro] destroy coro_id=15
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:27] [debug] [input coro] destroy coro_id=16
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:28] [debug] [input coro] destroy coro_id=17
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:28] [debug] [input coro] destroy coro_id=18
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:28] [ warn] [net] getaddrinfo(host='kubernetes.default.svc', err=12): Timeout while contacting DNS servers
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:28] [debug] [upstream] connection #-1 failed to kubernetes.default.svc:443
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:28] [error] [input:kubernetes_events:kubernetes_events.0] upstream connection initialization error
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:28] [debug] [input coro] destroy coro_id=0
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:29] [debug] [input coro] destroy coro_id=19
The following line is repeating constantly with raising coro_id numbers
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:29] [debug] [input coro] destroy coro_id=xxx
The application is getting frozen and no events will be forwarded
I also looked into the kube-apiserver logs but i can't see anything suspicious.
@edsiper #9154 will fix a potential deadlock which will prevent the kubernetes_events plugin from reconnecting the stream on a clean disconnect from the kubernetes api server or an error during connection initialization. Hopefully that can be merged in as part of the 3.1.5 milestone.
That change will ensure that the kubernetes_events plugin will continue to try to connect to the kube api when there's been an initialization error (which gets stuck behind the deadlock and I believe is the root cause of this issue).
An underlying problem which is triggering the deadlock could be due to a bad pod ip policy or transient issues with dns in the k8s cluster. (as I did not see any changes to flb_upstream_conn_get
which is being called to get a socket connection to the api server and does the getaddrinfo resolution behind the scenes)
thanks @ryanohnemus , #9154 has been merged and will be part of 3.1.5
thanks a lot for the quick support :heart:
Bug Report
Describe the bug
Since fluent-bit v3.1.x we see getaddrinfo errors in conjunction with the kubernetes_events input plugin. The Pod stay in "upstream connection initialization error" and no k8s events will be processed.
To Reproduce
Expected behavior
Your Environment
Configuration:
fluent-bit-events-7d9c5f9dc8-nvz62 fluent-bit [2024/07/30 11:11:04] [error] [http_client] broken connection to kubernetes.default.svc:443 ? fluent-bit-events-7d9c5f9dc8-nvz62 fluent-bit [2024/07/30 11:11:04] [ warn] [input:kubernetes_events:kubernetes_events.0] kubernetes chunked stream error. fluent-bit-events-7d9c5f9dc8-nvz62 fluent-bit [2024/07/30 11:11:04] [ info] [input:kubernetes_events:kubernetes_events.0] Requesting /api/v1/events?watch=1&resourceVersion=970938198 fluent-bit-events-7d9c5f9dc8-nvz62 fluent-bit [2024/07/30 11:44:53] [error] [http_client] broken connection to kubernetes.default.svc:443 ? fluent-bit-events-7d9c5f9dc8-nvz62 fluent-bit [2024/07/30 11:44:53] [ warn] [input:kubernetes_events:kubernetes_events.0] kubernetes chunked stream error. fluent-bit-events-7d9c5f9dc8-nvz62 fluent-bit [2024/07/30 11:44:54] [ info] [input:kubernetes_events:kubernetes_events.0] Requesting /api/v1/events?watch=1&resourceVersion=970980507