fluent / fluent-operator

Operate Fluent Bit and Fluentd in the Kubernetes way - Previously known as FluentBit Operator
Apache License 2.0
574 stars 245 forks source link

bug: "Too many open files" reported by all fluentbit pod #854

Open marshtompsxd opened 1 year ago

marshtompsxd commented 1 year ago

Describe the issue

Hi FluentBit developers, thanks for building this awesome operator!

I was following the guide to set up fluentbit and stream logs to kafka. The operator bring up the fluentbit daemon set but I find that each fluentbit process keeps restarting and reporting

...
level=info msg="backoff timer done" actual=1m4.0014963s expected=1m4s
level=info msg="Fluent bit started"
Fluent Bit v1.8.11
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/07/28 16:19:05] [ info] [engine] started (pid=64)
[2023/07/28 16:19:05] [ info] [storage] version=1.1.5, initializing...
[2023/07/28 16:19:05] [ info] [storage] in-memory
[2023/07/28 16:19:05] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2023/07/28 16:19:05] [ info] [cmetrics] version=0.2.2
[2023/07/28 16:19:05] [error] [lib] backend failed
[2023/07/28 16:19:05] [error] [plugins/in_tail/tail_fs_inotify.c:305 errno=24] Too many open files
[2023/07/28 16:19:05] [error] Failed initialize input tail.0
level=error msg="Fluent bit exited" error="exit status 255"
level=info msg=backoff delay=2m8s

and when I check kafka I find no logs streamed to it.

To Reproduce

Follow the guide.

Expected behavior

All fluentbit processes should successfully stream the log to kafka.

Your Environment

- Fluent Operator version: `kubesphere/fluent-operator:v1.0.2`
- Container Runtime: `containerd://1.6.19-46-g941215f49`
- Operating system: `Ubuntu 22.04.2 LTS`
- Kernel version: `5.10.76-linuxkit`

My env is a kind cluster with 1 control-plane node and 3 worker nodes running on a Macbook

NAME                 STATUS   ROLES           AGE   VERSION
kind-control-plane   Ready    control-plane   17m   v1.26.3
kind-worker          Ready    <none>          16m   v1.26.3
kind-worker2         Ready    <none>          16m   v1.26.3
kind-worker3         Ready    <none>          16m   v1.26.3


### How did you install fluent operator?

By running https://github.com/kubesphere-sigs/fluent-operator-walkthrough/blob/master/deploy-fluent-operator.sh

### Additional context

_No response_
benjaminhuo commented 1 year ago

@marshtompsxd The version in the walkthrough is old, I suggest to install the latest fluent operator and the latest fluentbit https://github.com/fluent/fluent-operator#deploy-fluent-operator-with-helm

marshtompsxd commented 1 year ago

Hi @benjaminhuo Thanks for your reply. I tried to install v2.4.0 operator and use v2.1.7 fluentbit but the same error happens

level=info time=2023-07-28T17:47:18Z msg="backoff timer done" actual=2m8.0001472s expected=2m8s
level=info time=2023-07-28T17:47:18Z msg="Fluent bit started"
Fluent Bit v2.1.7
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/07/28 17:47:18] [ info] [fluent bit] version=2.1.7, commit=e70a93cfdb, pid=45
[2023/07/28 17:47:18] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2023/07/28 17:47:18] [ info] [cmetrics] version=0.6.3
[2023/07/28 17:47:18] [ info] [ctraces ] version=0.3.1
[2023/07/28 17:47:18] [ info] [input:tail:tail.0] initializing
[2023/07/28 17:47:18] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2023/07/28 17:47:18] [error] [/src/fluent-bit/plugins/in_tail/tail_fs_inotify.c:347 errno=24] Too many open files
[2023/07/28 17:47:18] [error] [lib] backend failed
[2023/07/28 17:47:18] [error] failed initialize input tail.0
[2023/07/28 17:47:18] [error] [engine] input initialization failed
level=error time=2023-07-28T17:47:18Z msg="Fluent bit exited" error="exit status 255"
level=info time=2023-07-28T17:47:18Z msg=backoff delay=4m16s
marshtompsxd commented 1 year ago

And interestingly, when I run in a kind cluster with a single node

kind-control-plane   Ready    control-plane   6m13s   v1.26.3

then the error disappears

[2023/07/28 18:01:27] [ info] [fluent bit] version=2.1.7, commit=e70a93cfdb, pid=17
[2023/07/28 18:01:27] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2023/07/28 18:01:27] [ info] [cmetrics] version=0.6.3
[2023/07/28 18:01:27] [ info] [ctraces ] version=0.3.1
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] initializing
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2023/07/28 18:01:27] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2023/07/28 18:01:27] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2023/07/28 18:01:27] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2023/07/28 18:01:27] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2023/07/28 18:01:27] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2023/07/28 18:01:27] [ info] [output:kafka:kafka.0] brokers='my-cluster-kafka-brokers.kafka.svc:9092' topics='fluent-log'
[2023/07/28 18:01:27] [ info] [sp] stream processor started
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=920028 watch_fd=1 name=/var/log/containers/coredns-787d4945fb-8hl87_kube-system_coredns-2f5c52d0e736ad1f9da7d5b42c006fdf42ad43ff23d6c54dac26d94d35fe8e84.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=920033 watch_fd=2 name=/var/log/containers/coredns-787d4945fb-wvmfs_kube-system_coredns-327111b674292a519c3a18dc1defc7a8a74a228155d23db24e378f4ca79aaed7.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=919770 watch_fd=3 name=/var/log/containers/etcd-kind-control-plane_kube-system_etcd-c8bd0c4ddc5b8a372502eae3c28a7d8dc94a8c3524323e3d323957f9fb56a9d0.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=921126 watch_fd=4 name=/var/log/containers/fluent-operator-8596459ddc-jpm8s_fluent_fluent-operator-d1ec1e6e86b828ce7c7f87c433c4ee51ec4a41d779c6841cc38892d8feb4f4be.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=921107 watch_fd=5 name=/var/log/containers/fluent-operator-8596459ddc-jpm8s_fluent_setenv-6bd602680aec42f6cc7efab9b66702f0e4323f935d1a80dcf10a23e21fb51f49.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=919870 watch_fd=6 name=/var/log/containers/kindnet-wmlcz_kube-system_kindnet-cni-e96962875f9d2655c6698da8e7012dc167a4ccdd307bf3699474fef9326155f9.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=919764 watch_fd=7 name=/var/log/containers/kube-apiserver-kind-control-plane_kube-system_kube-apiserver-d955f1e1c81f368f9f759dd2c87169f25b34e89f731e10246662f50aad9dcb7e.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=919757 watch_fd=8 name=/var/log/containers/kube-controller-manager-kind-control-plane_kube-system_kube-controller-manager-84c85d37e1a90518865a22fc71645d60a1be6b6b1de41f8b160e58e60a018c41.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=919855 watch_fd=9 name=/var/log/containers/kube-proxy-ktms7_kube-system_kube-proxy-c8822e0450c61420cad5288f0ca24ba763f42605a4a425481a3a2a9c1016f3ae.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=919753 watch_fd=10 name=/var/log/containers/kube-scheduler-kind-control-plane_kube-system_kube-scheduler-ade9bed26b4457f935bb65b54239a76383083c0a2d14a1c3dccba6c21f266287.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=920051 watch_fd=11 name=/var/log/containers/local-path-provisioner-75f5b54ffd-q4ddg_local-path-storage_local-path-provisioner-1d95f9639add8e5cdc23c920e3aae7f7752f5a810edc99be7fff77b7a820986c.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=920528 watch_fd=12 name=/var/log/containers/my-cluster-entity-operator-544bdfcc95-9w9bx_kafka_tls-sidecar-405cef2fbd7b29fed36b036381276536a056df9e22b2db92c282a7dc3dc4a33e.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=920486 watch_fd=13 name=/var/log/containers/my-cluster-entity-operator-544bdfcc95-9w9bx_kafka_topic-operator-f956926cfc40401b40c5b3382674a39cfac32a449f33047ee293a1cac39e102a.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=920503 watch_fd=14 name=/var/log/containers/my-cluster-entity-operator-544bdfcc95-9w9bx_kafka_user-operator-b5980641ac0788282b12bdfbc33d2172d812434365786da409ff092a8828182a.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=920351 watch_fd=15 name=/var/log/containers/my-cluster-kafka-0_kafka_kafka-4ffdf9e7ec2f6813abd37ffa330b4f7318808d88def3f28b1844ab9aafdaef2d.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=920230 watch_fd=16 name=/var/log/containers/my-cluster-zookeeper-0_kafka_zookeeper-77b44fe475300c16485accf07735c2c8084f580110dbdc4b69868f7943cd439b.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=920128 watch_fd=17 name=/var/log/containers/strimzi-cluster-operator-588fc86bcc-w7zdq_kafka_strimzi-cluster-operator-e8bd5d11b2267fd3ad0c255fa57d2ba08a72791be277c4d26a9a2caad522f96a.log
[2023/07/28 18:01:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=921193 watch_fd=18 name=/var/log/containers/fluent-bit-wn5zj_fluent_fluent-bit-29207ef999768ef58f7166393a9ecfbc0095b21276ae750c8101d64ea47d6870.log
benjaminhuo commented 1 year ago

I tried to install v2.4.0 operator and use v2.1.7 fluentbit but the same error happens

@marshtompsxd Would you give more info about the cluster in which you have this error? such as the node os version, k8s version, node type etc We'll take a look @wenchajun @wanjunlei

marshtompsxd commented 1 year ago

Thanks!

The node OS version is Ubuntu 22.04.2 LTS (kernel is 5.10.76-linuxkit). The k8s version is 1.26.3. The cluster is a kind cluster running on my local macbook.

wanjunlei commented 1 year ago

@marshtompsxd You can try to modify the nofile parameter. I haven’t thought of a quick way to modify it. It can only be achieved by rebuilding the image.

Add RUN ulimit -n 65535 to Dockerfile, then build the image use make build-fb.