Open rossbishop opened 1 year ago
Did I understand you correctly that even though you are getting those errors data is flowing? If that's the case then I wonder if that's due to time slice shifting. Could you try these two things individually and then combined?
threaded on
to the input plugin (forward)workers 1
to the output plugin (gelf)That should greatly alleviate the pressure on the main thread and could give us some valuable insight.
Hi Leonardo,
Thanks for the prompt reply, apologies for the delay I had a long weekend off!
So, I tried:
threaded on
and workers 1
set in the input and output plugins respectivelythreaded on
set in the input pluginworkers 1
set in the output pluginI only tried this on the server/receiver side; I'm still experiencing the same errors
In that case the only thing that comes to mind is using kubeshark to capture the traffic which would let us know if those connection attempts are being aborted by the remote host due to a delayed handshake attempt or what exactly is going on.
If you decide to capture the traffic you can share those pcaps in private with me in slack. I'll look at them, give you some feedback and try to come up with the next step.
Was there any resolution on this? I'm seeing the same thing for fluent-bit running in a Nomad environment. We're presently running version 2.1.4.
I turned off all outputs to minimize the configuration.
Here is the INPUT configuration:
[INPUT]
Name forward
Listen 0.0.0.0
port 24224
threaded on
tls on
tls.debug 4
tls.verify off
tls.ca_file /fluent-bit/etc/ca.cert.pem
tls.crt_file /fluent-bit/etc/devl.cert.pem
tls.key_file /fluent-bit/etc/devl.key.pem
Here is a sample of the log output:
[2023/06/07 21:44:35] [error] [tls] error: unexpected EOF
[2023/06/07 21:44:35] [debug] [downstream] connection #55 failed
[2023/06/07 21:44:35] [error] [input:forward:forward.0] could not accept new connection
Disregard my issue. I found that I had my local nomad logger, logging to fluentbit as well and that does not support TLS. That was the source of my errors. Once I added a non-TLS port for that traffic, the errors cleared up.
I saw the same issue. It seems like fluent-bit's throughput just becomes lower when enabling tls in forward input. As a result, forward output creates so many connections for newer chunks because existing connections are still used, then forward input becomes refusing newer connections by could not accept new connection
error.
To prevent creating a large number of connections, set net.max_worker_connections
to 20
or something in forward input, which was introduced in 2.1.6. But it might also cause no upstream connections available
error in forward input .
https://docs.fluentbit.io/manual/administration/networking#max-connections-per-worker
Running into similar issues on 2.1.8 with TLS. Using default docker container, fluentd forwarding to fluentbit.
Config
[SERVICE]
log_level debug
[INPUT]
Name forward
Listen 0.0.0.0
Port 24002
Buffer_Chunk_Size 1M
Buffer_Max_Size 6M
tls on
tls.verify off
tls.crt_file /fluent-bit/etc/self_signed.crt
tls.key_file /fluent-bit/etc/self_signed.key
# [OUTPUT]
# Name stdout
# Match *
[OUTPUT]
Name kafka
Match *
Brokers kafka-1:9091,kafka-2:9092,kafka-3:9093
Topics kubernetes-main-ingress
Timestamp_Format iso8601
[2023/08/06 10:05:51] [debug] [out flush] cb_destroy coro_id=5
[2023/08/06 10:05:51] [debug] [task] destroy task=0x7fe6a2039aa0 (task_id=0)
[2023/08/06 10:05:52] [debug] [socket] could not validate socket status for #41 (don't worry)
[2023/08/06 10:05:53] [debug] [socket] could not validate socket status for #43 (don't worry)
[2023/08/06 10:05:55] [debug] [socket] could not validate socket status for #41 (don't worry)
[2023/08/06 10:05:56] [debug] [socket] could not validate socket status for #40 (don't worry)
[2023/08/06 10:06:01] [debug] [socket] could not validate socket status for #43 (don't worry)
[2023/08/06 10:06:01] [debug] [socket] could not validate socket status for #44 (don't worry)
[2023/08/06 10:06:02] [debug] [input chunk] update output instances with new chunk size diff=34495, records=28, input=forward.0
[2023/08/06 10:06:02] [debug] [socket] could not validate socket status for #43 (don't worry)
[2023/08/06 10:06:02] [debug] [task] created task=0x7fe6a2039780 id=0 OK
[2023/08/06 10:06:03] [debug] [socket] could not validate socket status for #44 (don't worry)
{"stream"=>"[2023/08/06 10:06:03] [debug] in produce_message
Log level error
Fluent Bit v2.1.8
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2023/08/06 10:29:47] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/06 10:29:47] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/06 10:29:48] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/06 10:29:48] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/06 10:29:51] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/06 10:29:51] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
Any updates on this? This is killing my production environment performance :)
@ict-one-nl could you paste the fluentd config, mainly the match statement
I asked around, is this what you were asking for?
<label @xxxxx>
<match kubernetes.**>
@type tag_normaliser
@id flow:xxxxx:xxxxx:0
format ${namespace_name}.${pod_name}.${container_name}
</match>
<filter **>
@type parser
@id flow:xxxx:xxxxx:1
key_name message
remove_key_name_field true
reserve_data true
<parse>
@type json
</parse>
</filter>
<match **>
@type forward
@id flow:xxxx:xxxx:output:xxxx:xxxxx-logging
tls_allow_self_signed_cert true
tls_insecure_mode true
transport tls
<buffer tag,time>
@type file
chunk_limit_size 8MB
path /buffers/flow:xxx:xxx:output:xxx:xxxxxx.*.buffer
retry_forever true
timekey 10m
timekey_wait 1m
</buffer>
<server>
host xxxxxxxx.nl
port 24002
</server>
</match>
</label>
Thanks @ict-one-nl , I'm wondering if the buffer size needs to be larger on the Fluent Bit side to match what you have there with 8MB chunk limit. You may want to try and lower that on Fluentd side as well
I have tried the larger buffer size:
[SERVICE]
log_level error
[INPUT]
Name forward
Listen 0.0.0.0
Port 24002
Buffer_Chunk_Size 8M
Buffer_Max_Size 128M
tls on
tls.verify off
tls.crt_file /fluent-bit/etc/self_signed.crt
tls.key_file /fluent-bit/etc/self_signed.key
[OUTPUT]
Name kafka
Match *
Brokers kafka-1:9091,kafka-2:9092,kafka-3:9093
Topics kubernetes-main-ingress
Timestamp_Format iso8601
# [OUTPUT]
# Name stdout
# Match *
[2023/08/23 14:52:31] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:34] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:34] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:36] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:36] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:38] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:38] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:44] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:44] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
Same result. Will try the lower chunk size as well
FluentD config:
<label @3131968136fc14962a3d7a781ba6abe4>
<match kubernetes.**>
@type tag_normaliser
@id flow:nginx-ingress:mxxx:0
format "${namespace_name}.${pod_name}.${container_name}"
</match>
<filter **>
@type parser
@id flow:nginx-ingress:mxxx:1
key_name "message"
remove_key_name_field true
reserve_data true
<parse>
@type "json"
</parse>
</filter>
<match **>
@type forward
@id flow:nginx-ingress:mxxx:output:nginx-ingress:xxxlogging
tls_allow_self_signed_cert true
tls_insecure_mode true
transport tls
<buffer tag,time>
@type "file"
chunk_limit_size 1MB
path "/buffers/flow:nginx-ingress:mxxx:output:nginx-ingress:xxxlogging.*.buffer"
retry_forever true
timekey 10m
timekey_wait 1m
</buffer>
<server>
host "xxxx"
port 24002
</server>
</match>
</label>
Fluent-bit config
[SERVICE]
log_level error
[INPUT]
Name forward
Listen 0.0.0.0
Port 24002
Buffer_Chunk_Size 1M
Buffer_Max_Size 128M
tls on
tls.verify off
tls.crt_file /fluent-bit/etc/self_signed.crt
tls.key_file /fluent-bit/etc/self_signed.key
[OUTPUT]
Name kafka
Match *
Brokers kafka-1:9091,kafka-2:9092,kafka-3:9093
Topics kubernetes-main-ingress
Timestamp_Format iso8601
# [OUTPUT]
# Name stdout
# Match *
[2023/08/23 14:55:44] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:55:44] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:55:50] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:55:50] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:03] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:03] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:06] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:06] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:06] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:06] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:15] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:15] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
Does seem to lower the error rate a bit, but no solution.
I observe the same issue with v2.1.8. How is your instance deployed? In my case in directly on VM (Ubuntu 20.04)
This is the default fluent-bit container hosted in Docker on RHEL8
In my case, the setup is Fluent-bit_1 (on external k8s; Forward output plugin)-> Fluent-bit_2 (on Azure VM; Forward input plugin + Kafka output plugin) -> Kafka... Some (?) logs are flowing, although Fluent-bit_2 instance shows repetitive errors:
[2023/09/11 09:06:50] [error] [/tmp/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/09/11 09:06:50] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
with occasional:
[2023/09/12 18:00:14] [error] [tls] error: unexpected EOF
[2023/09/12 18:00:14] [error] [input:forward:forward.1] could not accept new connection
Not much happening here I see... :( If I can provide any more info, which could be useful to understand the TLS issue please advise. Update to 2.2.0 didn't help.
I'm sorry to say but we have moved away from fluentbit for most use cases because of this and because solving it takes quite long.
I'm sorry to say too. I deployed nginx servers as reverse proxy to terminate TLS instead. It has been very stable so far.
Solved in my case by lowering the net.keepalive_idle_timeout
(in my case to 30 sec).
I assume that fluent-bit assumed the connections to be alive, while the server side had already discarded them.
Well 30 is supposed to be the default isnt it? https://docs.fluentbit.io/manual/administration/networking
My bad... 30 sec was the original timeout - I lowered it to 10 sec.
Anyway I'm still not sure whether the error and net.keepalive_idle_timeout
are related or not...
I faced the same issue recently. My fluent bit pods were running behind a kubernetes load balancer which was sending health probes. These health probes were causing the "[error] [tls] error: unexpected EOF" errors. To fix this, I modified the externalTrafficPolicy to Local and updated the healthCheckNodePort. This ensures Kubernetes LB send the health probes on a separate port. Refer this for configuration: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip
This is happening to us too. In logs I see:
[2024/01/17 12:39:26] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:39:26] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 12:52:20] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:52:20] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 12:54:22] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:54:22] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
We have connections with fluent-bit and fluentd, however I am not currently able to say from which one this origins.
Well I have just noticed that lots of data are in fact really missing.
If I added require_ack_response true on fluentd side the data starts flowing and error disappears in fluentbit logs. However fluentd side cpu rises a lot, thats probably because he does not get the ack response and has to resend the messages that were not accepted (just a guess). So that suggests that there really is something wrong on fluentbits side.
Could someone please look into this? Seems like quite a problem, in our case it influences all metrics from fluentd.
We had to use http plugins instead of forward for fluentd->fluentbit communication, that works. I would recommend the same, because one or the other is doing something wrong and given that it involves two separate projects and how long this issue is open it doesnt seem to be solved soon. However fluentbit->fluentbit works in our case both with forward and http plugin. in case of fluentd->fluentbit when swapping forward plugin for http plugin it is necessary to update it like this in fluentd conf:
<format>
@type json
</format>
json_array true
and to prepend the logs with similar filter:
<filter **>
@type record_transformer
<record>
tag my_server_pretag.${tag}
</record>
</filter>
Then on fluentbit side you just add to config:
tag_key tag
Dont forget to set the endpoint address with httpS, which I overlooked at first and is hard to debug.
This is happening to us too. In logs I see:
[2024/01/17 12:39:26] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer [2024/01/17 12:39:26] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib [2024/01/17 12:52:20] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer [2024/01/17 12:52:20] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib [2024/01/17 12:54:22] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer [2024/01/17 12:54:22] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib [2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer [2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib [2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer [2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
We have connections with fluent-bit and fluentd, however I am not currently able to say from which one this origins.
We came across this error when playing around with the keep_alive settings. We purposefully increased it beyond the Load Balancer keep alive and we got that error when, I believe, got past the 60 sec LB timeout and it closed the connection.
As a slight update to the OP: we're still trying to clear the source of the error, but managed to clear a whole lot of them using ksniff (kubeshark was a bit too invasive for our taste). We then identified that our Prometheus tags/annotations for the fluent-bit server instance were misconfigured and Prometheus was trying to scrape that endpoint. That cleared a huge chunk of the errors for us but we're still trying to figure out the source of the few remaining entries.
We've gone ahead and enabled metrics and have been monitoring our setup. We got some new insights:
scheduler.base
and the Output net.connect_timeout
and Workers
lessened the amount of retries and so far we've yet to spot any dropped messages. However we're still witnessing TLS errors in the same timeframe as these retries;Based on the above it seems there's something amiss when the two fluent-bit instances end up terminating the connection and go for a retry. Any suggestions on what we could do next to try and help address the issue?
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
still open
facing similar issue. Sender app (teleport) is using fluentd library with following config:
ca = "/opt/event-handler/ca.crt"
cert = "/opt/event-handler/client.crt"
key = "/opt/event-handler/client.key"
url = "https://localhost:8888/test.log"
session-url = "https://localhost:8888/session"
however, http input plugin on fluent-bit is unable to accept the connection
[2024/09/09 11:01:00] [error] [tls] error: unexpected EOF
[2024/09/09 11:01:00] [debug] [downstream] connection #51 failed
even sending sample data via curl although accepted but logs show :
[2024/09/09 11:12:55] [error] [/tmp/fluent-bit/src/tls/openssl.c:551 errno=0] Success
[2024/09/09 11:12:55] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/09/09 11:12:55] [debug] [socket] could not validate socket status for #51 (don't worry)
[2024/09/09 11:12:55] [debug] [task] created task=0x7fba6d036640 id=0 OK
[2024/09/09 11:12:55] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] test.log: [[1725880375.021842822, {}], {"json"=>"{"foo":"bar"}"}]
[2024/09/09 11:12:55] [debug] [out flush] cb_destroy coro_id=1
[2024/09/09 11:12:55] [debug] [task] destroy task=0x7fba6d036640 (task_id=0)
fluent-bit input config:
[INPUT]
name http
listen 0.0.0.0
port 8888
threaded On
tls On
tls.verify Off
tls.debug 4
tls.ca_file /opt/event-handler/ca.crt
tls.crt_file /opt/event-handler/server.crt
tls.key_file /opt/event-handler/server.key
tls.key_passwd xxxxxx
Any pointers to solve this will be helpful.
@LukasJerabek aiming for similar setup as yours: fluentd -> fluent-bit on http with mtls
I'm also considering another use case where fluentbit is a better fit than Vector (wineventlog). But the fact that this still hasn't been fixed is holding us back and is worrisome. There has been a whole new stable release in the meantime and this is not some small bug, http over tls is a very common scenario to forward logs with.
I am also seeing this, I have a .NET application posting high volumes of data to the HTTP endpoint, with TLS enabled.
We see dropped data and slow processing from fluentbit, with timeouts on the client side.
Bug Report
Describe the bug Fluent-bit produces a large number of TLS/connection errors in its logs when TLS is enabled with the forwarding input plugin.
The use case is one instance of fluent-bit running inside EC2 outputting logs to a receiver fluent-bit instance running inside a kube cluster to securely forward messages into graylog.
Observations:
[debug] [downstream] connection #51 failed
,[debug] [socket] could not validate socket status for #52 (don't worry)
To Reproduce Example log messages:
Occasionally:
Expected behavior Fluent-bit doesn't spew TLS errors
Your Environment
[INPUT] name forward listen 0.0.0.0 port 24224 tls on tls.debug 4 tls.verify on tls.crt_file /etc/tls/fluent-bit-ingress-tls/tls.crt tls.key_file /etc/tls/fluent-bit-ingress-tls/tls.key storage.type filesystem
[OUTPUT] Name gelf Match * Host ~URL omitted~ Port 12212 Mode tls tls On tls.verify Off tls.ca_file /fluent-bit/etc/ca.crt tls.vhost ~URL omitted~ Gelf_Short_Message_Key message Gelf_Host_Key container_name storage.total_limit_size 256MB
[SERVICE] parsers_file /fluent-bit/etc/parsers.conf
[INPUT] name forward listen 0.0.0.0 port 24224
[OUTPUT] Name stdout Format json_lines Match OUTPUT
[OUTPUT] Name forward Match OUTPUT Host ~URL omitted~ Port 24224 tls on tls.verify on tls.ca_file /etc/fluent-bit/ca.crt