fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.82k stars 1.58k forks source link

Fluent-bit hangs on remote (kubernetes) call and stops collecting logs #6501

Closed gillarda closed 1 year ago

gillarda commented 1 year ago

Bug Report

Describe the bug We use the Banzai logging operator for Kubernetes to aggregate the logs of our clusters. With the Kubernetes filter, fluent-bit requests the kube-apiserver for additionnal metadata about the pods. However we noticed that:

To Reproduce

  1. Deploy a K8S cluster with a single controlplane/etcd node and a single worker node
  2. Setup log collection of K8S pods with fluend-bit (I used the Banzai Cloud operation with a file ClusterOutput for fluentd to reproduce)
  3. Deploy a pod that generates logs
  4. tail the fluentd buffers : logs are being collected
  5. kill -STOP PID on the controlplane node (where PID is the PID of the kube-apiserver) to make the kube-apiserver hang
  6. Wait for fluent-bit to make a request to the apiserver or create a static pod to force that
  7. fluent-bit will hang and the tailing of the fluentd buffers will stop
  8. kill -CONT PID on the controlplane node (where PID is the PID of the kube-apiserver)
  9. log collection resumes
  10. kill -STOP PID on the controlplane node (where PID is the PID of the kube-apiserver)
  11. log collection stops
  12. Hard reset the controlplane node
  13. Wait the the kube-apiserver to come back up
  14. log collection never resumes unless fluent-bit is restarted

Expected behavior

Your Environment

[INPUT] Name tail DB /tail-db/tail-containers-state.db Mem_Buf_Limit 5MB Parser docker Path /var/log/containers/.log Refresh_Interval 5 Skip_Long_Lines On Tag kubernetes. [FILTER] Name kubernetes Buffer_Size 0 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Tag_Prefix kubernetes.var.log.containers Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Kube_URL https://kubernetes.default.svc:443 Match kubernetes.* Merge_Log On Use_Kubelet Off

[OUTPUT] Name forward Match * Host rancher-logging-root-fluentd.cattle-logging-system.svc.cluster.local Port 24240

Retry_Limit  False
* Environment name and version: K8S 1.23.14 
* Operating System and version: Experienced on bothe CentOS 7 and Rocky 8 nodes
* Filters and plugins: cf. configuration

**Additional context**

We experienced lost logs and were able the link the loss to ressource contention on the controlplane nodes that made the kube-apiserver unavailable.
We identified fluent-bit because both `fluentbit_input_bytes_total` and `fluentbit_output_proc_bytes_total` metrics stopped increasing.
No error was found on the logs. Error counters did not increase. Fluent-bit just hanged.

Below are the backtraces of the various fluent-bit threads while it hangs.

/# /fluent-bit/bin/fluent-bit --version Fluent Bit v2.0.6 Git commit: 211f841fe2190cf7decc6183bfe697873ec22ff6

/# gdb /fluent-bit/bin/fluent-bit 1 GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git Copyright (C) 2021 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: https://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /fluent-bit/bin/fluent-bit... Attaching to program: /fluent-bit/bin/fluent-bit, process 1 [New LWP 7] [New LWP 8] [New LWP 9] [New LWP 10] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f01183e3561 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffefa0ba090, rem=rem@entry=0x7ffefa0ba090) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48 48 ../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory. (gdb) bt

0 0x00007f01183e3561 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffefa0ba090, rem=rem@entry=0x7ffefa0ba090)

at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48

1 0x00007f01183e8d43 in __GI___nanosleep (requested_time=requested_time@entry=0x7ffefa0ba090, remaining=remaining@entry=0x7ffefa0ba090) at nanosleep.c:27

2 0x00007f01183e8c7a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55

3 0x0000562b77343999 in flb_main (argc=3, argv=0x7ffefa0ba248) at /src/fluent-bit/src/fluent-bit.c:1245

4 0x0000562b77343a20 in main (argc=3, argv=0x7ffefa0ba248) at /src/fluent-bit/src/fluent-bit.c:1264

(gdb) info threads Id Target Id Frame

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 5 days with no activity.