fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.73k stars 1.56k forks source link

[3.1.x] input:kubernetes_events getaddrinfo errors #9144

Closed HaveFun83 closed 1 month ago

HaveFun83 commented 1 month ago

Bug Report

Describe the bug

Since fluent-bit v3.1.x we see getaddrinfo errors in conjunction with the kubernetes_events input plugin. The Pod stay in "upstream connection initialization error" and no k8s events will be processed.

To Reproduce

Expected behavior

Your Environment

* Environment name and version (e.g. Kubernetes? What version?): kubernetes
* Server type and version: v1.28.10
* Operating System and version:  Debian GNU/Linux 11
* Filters and plugins: see config

**Additional context**
<!--- How has this issue affected you? What are you trying to accomplish? -->
<!--- Providing context helps us come up with a solution that is most useful in the real world -->

We rollback to fluent-bit v3.0.7 and k8s events will be forwarded again.
The log shows some minor errors

fluent-bit-events-7d9c5f9dc8-nvz62 fluent-bit [2024/07/30 11:11:04] [error] [http_client] broken connection to kubernetes.default.svc:443 ? fluent-bit-events-7d9c5f9dc8-nvz62 fluent-bit [2024/07/30 11:11:04] [ warn] [input:kubernetes_events:kubernetes_events.0] kubernetes chunked stream error. fluent-bit-events-7d9c5f9dc8-nvz62 fluent-bit [2024/07/30 11:11:04] [ info] [input:kubernetes_events:kubernetes_events.0] Requesting /api/v1/events?watch=1&resourceVersion=970938198 fluent-bit-events-7d9c5f9dc8-nvz62 fluent-bit [2024/07/30 11:44:53] [error] [http_client] broken connection to kubernetes.default.svc:443 ? fluent-bit-events-7d9c5f9dc8-nvz62 fluent-bit [2024/07/30 11:44:53] [ warn] [input:kubernetes_events:kubernetes_events.0] kubernetes chunked stream error. fluent-bit-events-7d9c5f9dc8-nvz62 fluent-bit [2024/07/30 11:44:54] [ info] [input:kubernetes_events:kubernetes_events.0] Requesting /api/v1/events?watch=1&resourceVersion=970980507

edsiper commented 1 month ago

@ryanohnemus can this be related to the newer watcher ?

ryanohnemus commented 1 month ago

@edsiper I would not expect the getaddrinfo error to be from the streaming changes. The error there is on plugin init, which is well before the kubernetes api watch occurs. Were there any underlying network changes between 3.0.7 and 3.1.3 that could impact getaddrinfo?

@HaveFun83 from your additional context log that error is when the kube api disconnects your watch stream, which can be expected to align with k8s api max call time limits. Is it reconnecting and getting frozen (no new event updates) or are there any other errors?

I'll add some tests to see if I can recreate the issue. Any extra details would be appreciated!

HaveFun83 commented 1 month ago

I enabled debug logging and got this here:

fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] Configuration:
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]  flush time     | 1.000000 seconds
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]  grace          | 5 seconds
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]  daemon         | 0
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] ___________
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]  inputs:
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]      kubernetes_events
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] ___________
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]  filters:
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]      nest.0
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]      nest.1
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]      lua.2
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]      modify.3
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]      grep.4
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]      modify.5
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] ___________
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]  outputs:
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]      gelf.0
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] ___________
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info]  collectors:
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [fluent bit] version=3.1.4, commit=c767acfe6c, pid=1
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [cmetrics] version=0.9.1
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [ctraces ] version=0.5.2
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [input:kubernetes_events:kubernetes_events.0] initializing
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [input:kubernetes_events:kubernetes_events.0] storage_strategy='memory' (memory only)
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [input:kubernetes_events:kubernetes_events.0] API server: https://kubernetes.default.svc:443
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [debug] [input:kubernetes_events:kubernetes_events.0] [thread init] initialization OK
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [input:kubernetes_events:kubernetes_events.0] thread instance initialized
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [debug] [kubernetes_events:kubernetes_events.0] created event channels: read=30 write=31
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [debug] [filter:modify:modify.3] Initialized modify filter with 0 conditions and 4 rules
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [filter:grep:grep.4] OR mode
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [debug] [filter:modify:modify.5] Initialized modify filter with 0 conditions and 1 rules
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [debug] [gelf:gelf.0] created event channels: read=34 write=35
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:19] [ info] [sp] stream processor started
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:20] [debug] [input coro] destroy coro_id=1
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:20] [debug] [input coro] destroy coro_id=2
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:21] [debug] [input coro] destroy coro_id=3
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:21] [debug] [input coro] destroy coro_id=4
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:22] [debug] [input coro] destroy coro_id=5
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:22] [debug] [input coro] destroy coro_id=6
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:23] [debug] [input coro] destroy coro_id=7
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:23] [debug] [input coro] destroy coro_id=8
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:24] [debug] [input coro] destroy coro_id=9
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:24] [debug] [input coro] destroy coro_id=10
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:25] [debug] [input coro] destroy coro_id=11
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:25] [debug] [input coro] destroy coro_id=12
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:26] [debug] [input coro] destroy coro_id=13
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:26] [debug] [input coro] destroy coro_id=14
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:27] [debug] [input coro] destroy coro_id=15
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:27] [debug] [input coro] destroy coro_id=16
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:28] [debug] [input coro] destroy coro_id=17
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:28] [debug] [input coro] destroy coro_id=18
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:28] [ warn] [net] getaddrinfo(host='kubernetes.default.svc', err=12): Timeout while contacting DNS servers
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:28] [debug] [upstream] connection #-1 failed to kubernetes.default.svc:443
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:28] [error] [input:kubernetes_events:kubernetes_events.0] upstream connection initialization error
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:28] [debug] [input coro] destroy coro_id=0
fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:29] [debug] [input coro] destroy coro_id=19

The following line is repeating constantly with raising coro_id numbers

fluent-bit-events-855db89f5-ntbh8 fluent-bit [2024/08/01 17:28:29] [debug] [input coro] destroy coro_id=xxx

The application is getting frozen and no events will be forwarded

I also looked into the kube-apiserver logs but i can't see anything suspicious.

ryanohnemus commented 1 month ago

@edsiper #9154 will fix a potential deadlock which will prevent the kubernetes_events plugin from reconnecting the stream on a clean disconnect from the kubernetes api server or an error during connection initialization. Hopefully that can be merged in as part of the 3.1.5 milestone.

That change will ensure that the kubernetes_events plugin will continue to try to connect to the kube api when there's been an initialization error (which gets stuck behind the deadlock and I believe is the root cause of this issue).

An underlying problem which is triggering the deadlock could be due to a bad pod ip policy or transient issues with dns in the k8s cluster. (as I did not see any changes to flb_upstream_conn_get which is being called to get a socket connection to the api server and does the getaddrinfo resolution behind the scenes)

edsiper commented 1 month ago

thanks @ryanohnemus , #9154 has been merged and will be part of 3.1.5

HaveFun83 commented 1 month ago

thanks a lot for the quick support :heart:

edsiper commented 1 month ago

fixed by https://github.com/fluent/fluent-bit/pull/9154