fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.63k stars 1.53k forks source link

Fluent-bit stops sending data #5485

Open merveyilmaz-netrd opened 2 years ago

merveyilmaz-netrd commented 2 years ago

Hi @merveyilmaz-netrd and @icecreamhead, We've released a couple of fixes that handle connection loss and timeout scenarios in 1.8.15 and 1.9.1. I'm closing this issue now, but if you still see the problem, feel free to reopen it or open a new one. We'll gladly assist you further once you provide a repro scenario.

I encountered with same problem in 1.9.3. I am reopening the issue...

Originally posted by @merveyilmaz-netrd in https://github.com/fluent/fluent-bit/issues/4497#issuecomment-1134600222

merveyilmaz-netrd commented 2 years ago

@edsiper @lecaros Any suggestions?

merveyilmaz-netrd commented 2 years ago

I am sharing my reproduce steps:

  1. I configured the fluent-bit service like here [(https://github.com/fluent/fluent-bit/issues/4497#issuecomment-1134600222)]. (You can use any dummy log generator script my script generate 95 mb per minute.)
  2. Stop the splunk after a while(For example 10 minutes later). Since we are using filesystem buffering data chunks will accumulate on the system. In my case i stopped the splunk daemon for 2.5 hours and 20 gb data accumulated on the system.
  3. After 20 gb accumulated in the system i started the splunk daemon and saw that chunks are getting stuck after a while.
lecaros commented 2 years ago

Hi @merveyilmaz-netrd Can you please share your configuration? It seems you wanted to refer a comment from issue 4497, but the link is broken. Were you able to reproduce by stopping the Splunk daemon less time?

merveyilmaz-netrd commented 2 years ago

Here is my configuration: [SERVICE] Flush 1 Daemon off Log_Level info storage.path /opt/fbit storage.sync normal storage.backlog.mem_limit 250MB storage.max_chunks_up 512

[INPUT] Name tail Path /var/log/test.log Tag test2 storage.type filesystem Buffer_Max_Size 256KB Mem_Buf_Limit 512MB DB /var/log/db.file DB.sync normal

[OUTPUT] Name splunk Match test2 Host XX.XX.XX.XXX Port 8088 splunk_token XXXXXXXXXXXXXXXXXXXX splunk_send_raw on event_index my-index TLS On TLS.Verify Off Retry_Limit false

I stopped the splunk daemon for 1 hour but the result is same.

merveyilmaz-netrd commented 2 years ago

@lecaros BTW i keep getting following logs. May 26 17:20:44 test-vm-1 fluent-bit: [2022/05/26 17:20:44] [ info] [task] re-schedule retry=0x7fc121374410 2012 in the next 453 seconds May 26 17:20:44 test-vm-1 fluent-bit: [2022/05/26 17:20:44] [ info] [task] re-schedule retry=0x7fc121318ea0 1841 in the next 313 seconds May 26 17:20:44 test-vm1 fluent-bit: [2022/05/26 17:20:44] [ info] [task] re-schedule retry=0x7fc12131b7e0 1973 in the next 407 seconds May 26 17:20:44 test-vm-1 fluent-bit: [2022/05/26 17:20:44] [ info] [task] re-schedule retry=0x7fc121318310 1804 in the next 59 seconds May 26 17:20:47 test-vm-1 fluent-bit: [2022/05/26 17:20:47] [ info] [task] re-schedule retry=0x7fc12130bf60 1741 in the next 69 seconds . . .

Fluent-bit could not sent the logs on the filesystem. Logs size in the filesystem kept increasing so i restarted the splunk daemon.After restarted the splunk daemon i observed that following logs are immediately generating by fluent-bit and fluentbit stopped and did not process any log. (the log size in the filesystem did not increase or decrease. )

May 26 17:53:28 test-vm-1 fluent-bit: [2022/05/26 17:53:28] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/4307-1653575045.7315446.flb May 26 17:53:28 test-vm-1 fluent-bit: [2022/05/26 17:53:28] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/4307-1653575046.48151237.flb May 26 17:53:28 test-vm-1 fluent-bit: [2022/05/26 17:53:28] [ info] [input:storage_backlog:storage_backlog.1] register

lecaros commented 2 years ago

can you share the full log of Fluent-Bit? also the result of ls -lh /opt/fbit and kill -SIGCONT <fluent-bit-pid>

merveyilmaz-netrd commented 2 years ago
[root@test-vm testuser]# ls -lh /opt/fbit/
total 768K
drwxr-xr-x 2 root root 764K May 27 08:46 tail.0

kill -SIGCONT pid output: May 27 12:15:13 test-vm fluent-bit: [2022/05/27 12:15:13] [engine] caught signal (SIGCONT) May 27 12:15:13 test-vm fluent-bit: [2022/05/27 12:15:13] Fluent Bit Dump May 27 12:15:13 test-vm fluent-bit: ===== Input ===== May 27 12:15:13 test-vm fluent-bit: tail.0 (tail) May 27 12:15:13 test-vm fluent-bit: │ May 27 12:15:13 test-vm fluent-bit: ├─ status May 27 12:15:13 test-vm fluent-bit: │ └─ overlimit : no May 27 12:15:13 test-vm fluent-bit: │ ├─ mem size : 0b (0 bytes) May 27 12:15:13 test-vm fluent-bit: │ └─ mem limit : 488.3M (512000000 bytes) May 27 12:15:13 test-vm fluent-bit: │ May 27 12:15:13 test-vm fluent-bit: ├─ tasks May 27 12:15:13 test-vm fluent-bit: │ ├─ total tasks : 0 May 27 12:15:13 test-vm fluent-bit: │ ├─ new : 0 May 27 12:15:13 test-vm fluent-bit: │ ├─ running : 0 May 27 12:15:13 test-vm fluent-bit: │ └─ size : 0b (0 bytes) May 27 12:15:13 test-vm fluent-bit: │ May 27 12:15:13 test-vm fluent-bit: └─ chunks May 27 12:15:13 test-vm fluent-bit: └─ total chunks : 0 May 27 12:15:13 test-vm fluent-bit: ├─ up chunks : 0 May 27 12:15:13 test-vm fluent-bit: ├─ down chunks: 0 May 27 12:15:13 test-vm fluent-bit: └─ busy chunks: 0 May 27 12:15:13 test-vm fluent-bit: ├─ size : 0b (0 bytes) May 27 12:15:13 test-vm fluent-bit: └─ size err: 0 May 27 12:15:13 test-vm fluent-bit: storage_backlog.1 (storage_backlog) May 27 12:15:13 test-vm fluent-bit: │ May 27 12:15:13 test-vm fluent-bit: ├─ status May 27 12:15:13 test-vm fluent-bit: │ └─ overlimit : no May 27 12:15:13 test-vm fluent-bit: │ ├─ mem size : 0b (0 bytes) May 27 12:15:13 test-vm fluent-bit: │ └─ mem limit : 0b (0 bytes) May 27 12:15:13 test-vm fluent-bit: │ May 27 12:15:13 test-vm fluent-bit: ├─ tasks May 27 12:15:13 test-vm fluent-bit: │ ├─ total tasks : 0 May 27 12:15:13 test-vm fluent-bit: │ ├─ new : 0 May 27 12:15:13 test-vm fluent-bit: │ ├─ running : 0 May 27 12:15:13 test-vm fluent-bit: │ └─ size : 0b (0 bytes) May 27 12:15:13 test-vm fluent-bit: │ May 27 12:15:13 test-vm fluent-bit: └─ chunks May 27 12:15:13 test-vm fluent-bit: └─ total chunks : 0 May 27 12:15:13 test-vm fluent-bit: ├─ up chunks : 0 May 27 12:15:13 test-vm fluent-bit: ├─ down chunks: 0 May 27 12:15:13 test-vm fluent-bit: └─ busy chunks: 0 May 27 12:15:13 test-vm fluent-bit: ├─ size : 0b (0 bytes) May 27 12:15:13 test-vm fluent-bit: └─ size err: 0 May 27 12:15:13 test-vm fluent-bit: ===== Storage Layer ===== May 27 12:15:13 test-vm fluent-bit: total chunks : 12271 May 27 12:15:13 test-vm fluent-bit: ├─ mem chunks : 0 May 27 12:15:13 test-vm fluent-bit: └─ fs chunks : 12271 May 27 12:15:13 test-vm fluent-bit: ├─ up : 86 May 27 12:15:13 test-vm fluent-bit: └─ down : 12185

merveyilmaz-netrd commented 2 years ago

fluent-bit-log-file.zip

And above is my fluent-bit log file.

merveyilmaz-netrd commented 2 years ago

fluent-bit-log-file.zip

And above is my fluent-bit log file.

Did you able to reproduce the scenario? @lecaros

lecaros commented 2 years ago

Hi, I've managed to reproduce the behavior (fluent bit not sending data after Splunk was stopped), however, I have some differences in the SIGCONT output. When did you run this? When you sent the SIGCONT signal, were you still reading/producing logs in the monitored files? This could explain the differences. Now I'll try to get a faster way to reproduce this behavior and then look for root cause.

Thanks!

merveyilmaz-netrd commented 2 years ago

Hi, I've managed to reproduce the behavior (fluent bit not sending data after Splunk was stopped), however, I have some differences in the SIGCONT output. When did you run this? When you sent the SIGCONT signal, were you still reading/producing logs in the monitored files? This could explain the differences. Now I'll try to get a faster way to reproduce this behavior and then look for root cause.

Thanks!

I stopped the splunk daemon and after 2.30 hours i restarted it. And next day i sent SIGCONT signal. I continued to generate logs in the monitored files for a while, but when I sent the SIGCONT signal, I was no longer producing the log because it was the next day.

nkinkade commented 1 year ago

We are seeing something similar, but we are not using Splunk but instead the Stackdriver output plugin. The issue happens fairly regularly, and usually (not always) the fluent-bit logs indicate some sort of network connectivity issue. In one particular case that we are experiencing right now, fluent-bit's final log message is from ~6 days ago:

[...]
[2022/08/24 23:21:16] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=33197 watch_fd=73
[2022/08/24 23:21:21] [ info] [input:tail:tail.0] inotify_fs_add(): inode=33099 watch_fd=74 name=/var/log/containers/neubot-jpjws_default_packet-headers-2b74d047e650441aa957533814deae7a97c0f7a2f99e6145e4b3fa99e661f824.log
[2022/08/24 23:33:04] [ warn] [net] getaddrinfo(host='kubernetes.default.svc', err=-2): Name or service not known
[2022/08/24 23:33:04] [error] [filter:kubernetes:kubernetes.0] kubelet upstream connection error
[2022/08/25 00:05:52] [ info] [oauth2] HTTP Status=200
[2022/08/25 00:05:52] [ info] [oauth2] access token from 'oauth2.googleapis.com:443' retrieved

There is the Name or service not known error, then a log message about a kubelet upstream connection error, then two normal log lines, then nothing else. Logs are not being pushed to Stackdriver, nor are logs being buffered on the local filesystem:

root@fluentbit-snrpx:/# ls -l /var/lib/fluent-bit/tail.0 
total 336
-rw------- 1 root root 36864 Aug 25 00:28 1-1661387272.398485676.flb
-rw------- 1 root root 36864 Aug 25 00:27 1-1661387273.29687021.flb
-rw------- 1 root root 36864 Aug 25 00:28 1-1661387273.29959922.flb
-rw------- 1 root root 69632 Aug 25 00:28 1-1661387273.69224462.flb
-rw------- 1 root root  4096 Aug 25 00:28 1-1661387274.286094189.flb
-rw------- 1 root root 36864 Aug 25 00:28 1-1661387274.286348922.flb
-rw------- 1 root root  4096 Aug 25 00:28 1-1661387274.322520535.flb
-rw------- 1 root root 36864 Aug 25 00:27 1-1661387274.616071575.flb
-rw------- 1 root root 36864 Aug 25 00:27 1-1661387275.182560142.flb
-rw------- 1 root root 36864 Aug 25 00:27 1-1661387278.915490389.flb
-rw------- 1 root root  4096 Aug 25 00:27 1-1661387279.215381844.flb
-rw------- 1 root root  4096 Aug 25 00:28 1-1661387282.16952810.flb

When I run strace against the fluent-bit process, this system call appear to be run roughly every 1s forever:

# strace -p 446487
strace: Process 446487 attached
restart_syscall(<... resuming interrupted read ...>) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fff3b18f710) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fff3b18f710) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fff3b18f710) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fff3b18f710) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fff3b18f710) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fff3b18f710) = 0
clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=1, tv_nsec=0}, 0x7fff3b18f710) = 0
[...]

These circumstances are a bit different than some others that we see, but the ultimate symptom is that fluent-bit stops sending logs to Stackdriver, and there are generally no useful log messages from fluent-bit, other than ones indicating some sort of network disruption. The only way we have found to recover is to kill the pod.

Our fluent-bit configuration can be found on Github.

merveyilmaz-netrd commented 1 year ago

Hi @lecaros, Did you able to find the root cause of the issue?

kingnarmer commented 1 year ago

Same issue here, I see "clock_nanosleep" messages in strace. Appreciate help on how to find root cause of this issue.

JonZeolla commented 1 year ago

I'm having the same issue, and seeing the same strace... My output is loki but that doesn't seem to be relevant.

I'm running fluent-bit as a daemon secondary process in a container, as my container runs in hosted pipelines (Github actions, bitbucket, etc.) and fluent-bit is used to get the logs out. When the primary process ends, the container shuts down and fluent-bit doesn't dequeue in time. I am looking for a way to ensure it dequeues entirely, which happens properly when the log size is small, but not when one log is a few KB.

JonZeolla commented 1 year ago

It appears that fluent-bit is multi-threaded, so if you want to see everything it's doing with strace you need to add -f.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

yackushevas commented 1 year ago

Not stale.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

kingnarmer commented 1 year ago

I am still running into this. Appreciate help.

dtoni commented 1 year ago

This seems to be a very old issue which was reported multiple times in different forms and it still occurs even in the latest 2.1.7 version. Because I'm running the same configuration on both AKS and OCP (filesystem buffering with multiple ES outputs) and the issue is only present on OCP maybe my observation could help the investigation. If you look at the network traffic once the retries start, for some reason the tcp connections on OCP are no longer opened once closed which makes fluent bit stop sending logs to ES until the pod is restarted.

kingnarmer commented 1 year ago

I am still running into this. Appreciate help.

dud225 commented 11 months ago

Hello I'm also facing this issue on version 1.9.9. As read in another issue running strace -f mysteriously had the side effect of resuming Fluent-Bit :unamused:

carlosrmendes commented 7 months ago

facing the same with v2.2.1, any feedback?

lecaros commented 2 months ago

Hi there, can you confirm this is solved? @carlosrmendes @dud225 @kingnarmer @dtoni @yackushevas @JonZeolla @merveyilmaz-netrd