Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed.

hoperays commented 5 years ago

Bug Report

Describe the bug Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed.

To Reproduce Temporarily no method of reproduction.

Expected behavior Fluent-bit does not stuck.

Screenshots The log of fluent-bit stopped in 2019/01/05 21:57:41.

[root@node-1 ~]# kubectl get pod -n openstack -o wide | grep fluentbit | grep node-1
fluentbit-rvrm8                                  1/1       Running   1          4d        10.10.1.3       node-1
[root@node-1 ~]# kubectl logs -n openstack fluentbit-rvrm8 --tail=10 -f
[2019/01/05 21:57:11] [error] [io] TCP connection failed: fluentd-logging:24224 (No route to host)
[2019/01/05 21:57:11] [error] [out_fw] no upstream connections available
[2019/01/05 21:57:11] [error] [io] TCP connection failed: fluentd-logging:24224 (No route to host)
[2019/01/05 21:57:11] [error] [out_fw] no upstream connections available
[2019/01/05 21:57:11] [error] [io] TCP connection failed: fluentd-logging:24224 (No route to host)
[2019/01/05 21:57:11] [error] [out_fw] no upstream connections available
[2019/01/05 21:57:21] [ warn] net_tcp_fd_connect: getaddrinfo(host='fluentd-logging'): Name or service not known
[2019/01/05 21:57:21] [error] [out_fw] no upstream connections available
[2019/01/05 21:57:41] [ warn] net_tcp_fd_connect: getaddrinfo(host='fluentd-logging'): Temporary failure in name resolution
[2019/01/05 21:57:41] [error] [out_fw] no upstream connections available

The stack status of fluent-bit process as follow:

[root@node-1 ~]# ps -ef | grep fluent
root     15799  9283  0 11:07 pts/8    00:00:00 grep --color=auto fluent
root     16211 16193  0 1月05 ?       00:00:06 /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf
[root@node-1 ~]# cat /proc/16211/stack
[<ffffffff81209a40>] pipe_wait+0x70/0xc0
[<ffffffff81209ce9>] pipe_write+0x1f9/0x530
[<ffffffff811fffdd>] do_sync_write+0x8d/0xd0
[<ffffffff81200a9d>] vfs_write+0xbd/0x1e0
[<ffffffff812018af>] SyS_write+0x7f/0xe0
[<ffffffff816b50c9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Your Environment

Version used:
- Fluent-bit 0.13.4
- Fluentd 1.2.2

Configuration:


[SERVICE]
Daemon false
Flush 5
Log_Level info
Parsers_File parsers.conf

[INPUT] Buffer_Max_Size 2MB DB /var/log/flb_kube.db DB.Sync OFF Mem_Buf_Limit 5MB Name tail Parser docker Path /var/log/containers/.log Tag ${HOSTNAME}.kube.

[FILTER] Match ${HOSTNAME}.kube.* Merge_JSON_Log true Name kubernetes

[FILTER] Match * Name record_modifier Record hostname ${HOSTNAME}

[OUTPUT] Host ${FLUENTD_HOST} Match * Name forward Port ${FLUENTD_PORT} Retry_Limit False



* Environment name and version:
> - Kubernetes v1.9.8
* Operating System and version:
> - CentOS 7.4.1708
* Filters and plugins:
> - Input Plugin: Tail
> - Output Plugin: Forward

**Additional context**
For more information on the status at that time, please refer to the attached core file.
[core.16211.gz](https://github.com/fluent/fluent-bit/files/2749339/core.16211.gz)

I have restored the fluent-bit service by restarting it, but I would like to known the root cause of fluent-bit process stuck. Sorry to bother you.

sergeyg-earnin commented 5 years ago

the same issue after we redeployed fluentd in k8s cluster

zhulinwei commented 4 years ago

the same issue after we redeployed fluentd in k8s cluster...

zhulinwei commented 4 years ago

I found an interesting situation.

If use kubectl delete pod fluentd-pod, fluent-bit sometime will stuck and lost connection with fluentd, even if fluentd resumed.

But if use kubectl rollout restart deploy fluentd, the problem will not happen.

abhishek-sehgal954 commented 4 years ago

Hey, I am facing the same problem. I am dealing over some critical data. Has anyone found a workaround for this situation?

joezwlin commented 4 years ago

Hi Got the same problem when one of the Load Balancer's Host is unavailable temporarily, I'm trying to adjust retry_limit to see if it can be resolved.

tirelibirefe commented 4 years ago

does your Fluentd has a working service listening port 24224?

asaushkin commented 3 years ago

I run into the same problem. Flunt-bit and Fluend are running on the EC2 instances. Fluent-bit couldn't be recovered after fleuntd became temporarily unavailable.

LinkMaq commented 3 years ago

the same problem on fluentbit 1.7.2， fluentbit and fleuntbit are deployed on kubernetes, fluentbit use headless-service forward logs to fluentd. This problem is very frequent.

[2021/04/02 08:50:29] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)
[2021/04/02 08:50:29] [error] [net] cannot connect to cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240
[2021/04/02 08:50:29] [error] [output:forward:forward.0] no upstream connections available
[2021/04/02 08:50:29] [ warn] [engine] failed to flush chunk '1-1617353299.120165002.flb', retry in 8 seconds: task_id=0, input=tail.0 > output=forward.0 (out_id=0)
[2021/04/02 08:51:04] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)
[2021/04/02 08:51:04] [error] [net] cannot connect to cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240
[2021/04/02 08:51:04] [error] [output:forward:forward.0] no upstream connections available
[2021/04/02 08:51:04] [ warn] [engine] failed to flush chunk '1-1617353334.496987689.flb', retry in 6 seconds: task_id=1, input=tail.0 > output=forward.0 (out_id=0)
[2021/04/02 08:51:44] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)
[2021/04/02 08:51:44] [error] [net] cannot connect to cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240
[2021/04/02 08:51:44] [error] [output:forward:forward.0] no upstream connections available
[2021/04/02 08:51:44] [ warn] [engine] failed to flush chunk '1-1617353374.267709871.flb', retry in 8 seconds: task_id=2, input=tail.0 > output=forward.0 (out_id=0)
[2021/04/02 08:52:44] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)
[2021/04/02 08:52:44] [error] [net] cannot connect to cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240
[2021/04/02 08:52:44] [error] [output:forward:forward.0] no upstream connections available
[2021/04/02 08:52:44] [ warn] [engine] failed to flush chunk '1-1617353434.367870814.flb', retry in 10 seconds: task_id=3, input=tail.0 > output=forward.0 (out_id=0)
[2021/04/02 08:53:15] [error] [net] TCP connection failed: cluster-fluentd-5.cluster-fluentd-headless.logging.svc.cluster.local:24240 (Connection timed out)

loguido commented 3 years ago

Same happens to me, only way to solve for now is restart fluent-bit

LynnTh commented 3 years ago

Same issue.

VincentQiu2018 commented 2 years ago

Same issue

leonardo-albertovich commented 2 years ago

If anyone here has a reliable reproduction and is able to perform some tests with me contact me in the fluent slack and we'll found out the root of the issue.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

lecaros commented 2 years ago

Hi everyone here, We've released a couple of fixes that handle connection loss and timeout scenarios in 1.8.15 and 1.9.1. I'm closing this issue now, but if you still see the problem, feel free to reopen it or open a new one. We'll gladly assist you further once you provide a repro scenario.

fluent / fluent-bit

Fluent-bit stuck when it lost connection with fluentd, and still did not respond after fluentd resumed. #1022

Bug Report