fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.57k stars 1.53k forks source link

Fluent-Bit Upstream configuration Issues #8841

Open jganeshpai1994 opened 2 months ago

jganeshpai1994 commented 2 months ago

Fluent-Bit Upstream Server Issue

Fluent Bit Binaries while working with Upstream servers are not reliable. If one of the node goes down it is not able to retry the same chunk to other nodes which are live. The document says that it works in round robin fashion but the chunks are retried to the same dead host. This has caused issues and due to this we had to get load balancer and use forward plugin which helped us to mitigate this error.

To Reproduce

[INPUT] Name tail path /log/access_log.json tag acesss-log Key message Read_from_Head true Path_Key log.file.path DB /var/db/offset Mem_Buf_Limit 5MB storage.type filesystem Buffer_Max_Size 128k ....

[OUTPUT] Name forward Match * Upstream upstream.conf Retry_Limit False

- Upstream Configuration

[UPSTREAM] name forward-balancing

[NODE] name node-1 host node1 port 5043 tls on tls.verify off tls.ca_file /etc/td-agent-bit/certs_dev/root-ca.pem tls.crt_file /etc/td-agent-bit/certs_dev/fluent-bit.crt tls.key_file /etc/td-agent-bit/certs_dev/fluent-bit.key Retry_Limit False storage.total_limit_size 1G

[NODE] name node-2 host node2 port 5043 tls on tls.verify off tls.ca_file /etc/td-agent-bit/certs_dev/root-ca.pem tls.crt_file /etc/td-agent-bit/certs_dev/fluent-bit.crt tls.key_file /etc/td-agent-bit/certs_dev/fluent-bit.key Retry_Limit False storage.total_limit_size 1G

[NODE] name node-3 host node3 port 5043 tls on tls.verify off tls.ca_file /etc/td-agent-bit/certs_dev/root-ca.pem tls.crt_file /etc/td-agent-bit/certs_dev/fluent-bit.crt tls.key_file /etc/td-agent-bit/certs_dev/fluent-bit.key Retry_Limit False storage.total_limit_size 1G


- Steps to reproduce the problem:
Use the above config generate some logs using some scripts to the /log/access_log.json and check the data is available in UI
For our use case we are sending data to fluentd hosted in docker with ports opened on hosts and from fluentd we are routing it to Opensearch
Route - Fluent Bit(Linux machine) -> Flunetd(docker) -> Opensearch(docker)

**Expected behavior**
The behaviour has mentioned in the docs - [Fluent Bit Upstream server ](https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/classic-mode/upstream-servers) .
The data is sent in round-robin fashion
If the node is down it will retry if Retry_Limit is set to False or no_limits. The retries are happening but to the server which is down and it is not retrying it to the servers which are up 

**Screenshots**
<img width="1235" alt="image" src="https://github.com/fluent/fluent-bit/assets/29899440/bcda1117-bc46-4d60-8e08-d54c574c590d">
The above image shows the td-agent-bit goes into loop while not retrying to other node

[fluent-bit.log](https://github.com/fluent/fluent-bit/files/15388493/fluent-bit.log)
Fluent Bit Log File

**Your Environment**
<!--- Include as many relevant details about the environment you experienced the bug in -->
* Version used: 3.0.3
* Configuration: Mentioned above 
* Environment name and version (e.g. Kubernetes? What version?): OEL 7.9
* Server type and version: Oracle Linux 7.9
* Operating System and version: Oracle Linux 
* Filters and plugins: Basic Input File plugin and Forward Output Plugin mentioned above

**Additional context**
- This issue has affected us by not able to send the data 
- The only way we handled is with External Network Load balancer
jganeshpai1994 commented 2 months ago

@edsiper Let me know if you need more details for the above issue