Forward could not accept new connection/tls unexpected EOF errors

rossbishop commented 1 year ago

Bug Report

Describe the bug Fluent-bit produces a large number of TLS/connection errors in its logs when TLS is enabled with the forwarding input plugin.

The use case is one instance of fluent-bit running inside EC2 outputting logs to a receiver fluent-bit instance running inside a kube cluster to securely forward messages into graylog.

Observations:

Turning TLS verification off - still errors
Running in debug mode I get further error messages e.g. [debug] [downstream] connection #51 failed, [debug] [socket] could not validate socket status for #52 (don't worry)
Turning TLS off resolves the errors (predictably)
Most messages are still making it to the receiver/server fluent-bit as far as I can see, haven't yet identified if messages are being lost or not

To Reproduce Example log messages:

[2023/05/17 14:21:40] [error] [input:forward:forward.0] could not accept new connection
[2023/05/17 14:21:40] [error] [tls] error: unexpected EOF
[2023/05/17 14:21:40] [error] [input:forward:forward.0] could not accept new connection
[2023/05/17 14:21:41] [error] [tls] error: unexpected EOF
[2023/05/17 14:21:41] [error] [input:forward:forward.0] could not accept new connection
[2023/05/17 14:21:41] [error] [tls] error: unexpected EOF

Occasionally:

[2023/05/17 14:21:42] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=0] Success
[2023/05/17 14:21:42] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

Steps to reproduce the problem:
1. Run receiving fluent-bit in kube
2. Pod certificate issued by cert-manager vault issuer
3. Second external fluent-bit (e.g. in EC2) sending messages to the receiving fluent-bit with the vault ca_chain in config

Expected behavior Fluent-bit doesn't spew TLS errors

Your Environment

Version used: 2.1 (also 2.0.6)

Configuration: Fluent-bit helm chart running in kube fluent-bit.conf (server/receiver side - filters removed for simplicity)


[SERVICE]
Daemon off
Flush 10
Log_Level debug
parsers_file custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On
storage.path /var/logs/flb-logs
storage.sync full

[INPUT] name forward listen 0.0.0.0 port 24224 tls on tls.debug 4 tls.verify on tls.crt_file /etc/tls/fluent-bit-ingress-tls/tls.crt tls.key_file /etc/tls/fluent-bit-ingress-tls/tls.key storage.type filesystem

[OUTPUT] Name gelf Match * Host ~URL omitted~ Port 12212 Mode tls tls On tls.verify Off tls.ca_file /fluent-bit/etc/ca.crt tls.vhost ~URL omitted~ Gelf_Short_Message_Key message Gelf_Host_Key container_name storage.total_limit_size 256MB

fluent-bit.conf (client/sender side - filters removed for simplicity):

[SERVICE] parsers_file /fluent-bit/etc/parsers.conf

[INPUT] name forward listen 0.0.0.0 port 24224

[OUTPUT] Name stdout Format json_lines Match OUTPUT

[OUTPUT] Name forward Match OUTPUT Host ~URL omitted~ Port 24224 tls on tls.verify on tls.ca_file /etc/fluent-bit/ca.crt


* Environment name and version (e.g. Kubernetes? What version?):
Kubernetes 1.23 (EKS)
* Server type and version:
* Operating System and version:
Fedora CoreOS EC2 instance
* Filters and plugins:
Forward input/outputs
gelf output

**Additional context**
<!--- How has this issue affected you? What are you trying to accomplish? -->
<!--- Providing context helps us come up with a solution that is most useful in the real world -->

leonardo-albertovich commented 1 year ago

Did I understand you correctly that even though you are getting those errors data is flowing? If that's the case then I wonder if that's due to time slice shifting. Could you try these two things individually and then combined?

Add threaded on to the input plugin (forward)
Add workers 1 to the output plugin (gelf)

That should greatly alleviate the pressure on the main thread and could give us some valuable insight.

rossbishop commented 1 year ago

Hi Leonardo,

Thanks for the prompt reply, apologies for the delay I had a long weekend off!

So, I tried:

Both threaded on and workers 1 set in the input and output plugins respectively
Just threaded on set in the input plugin
Just workers 1 set in the output plugin

I only tried this on the server/receiver side; I'm still experiencing the same errors

leonardo-albertovich commented 1 year ago

In that case the only thing that comes to mind is using kubeshark to capture the traffic which would let us know if those connection attempts are being aborted by the remote host due to a delayed handshake attempt or what exactly is going on.

If you decide to capture the traffic you can share those pcaps in private with me in slack. I'll look at them, give you some feedback and try to come up with the next step.

zenchild commented 1 year ago

Was there any resolution on this? I'm seeing the same thing for fluent-bit running in a Nomad environment. We're presently running version 2.1.4. I turned off all outputs to minimize the configuration.

Here is the INPUT configuration:

[INPUT]
    Name           forward
    Listen         0.0.0.0
    port           24224
    threaded       on
    tls            on
    tls.debug      4
    tls.verify     off
    tls.ca_file    /fluent-bit/etc/ca.cert.pem
    tls.crt_file   /fluent-bit/etc/devl.cert.pem
    tls.key_file   /fluent-bit/etc/devl.key.pem

Here is a sample of the log output:

[2023/06/07 21:44:35] [error] [tls] error: unexpected EOF
[2023/06/07 21:44:35] [debug] [downstream] connection #55 failed
[2023/06/07 21:44:35] [error] [input:forward:forward.0] could not accept new connection

zenchild commented 1 year ago

Disregard my issue. I found that I had my local nomad logger, logging to fluentbit as well and that does not support TLS. That was the source of my errors. Once I added a non-TLS port for that traffic, the errors cleared up.

ksauzz commented 1 year ago

I saw the same issue. It seems like fluent-bit's throughput just becomes lower when enabling tls in forward input. As a result, forward output creates so many connections for newer chunks because existing connections are still used, then forward input becomes refusing newer connections by could not accept new connection error.

To prevent creating a large number of connections, set net.max_worker_connections to 20 or something in forward input, which was introduced in 2.1.6. But it might also cause no upstream connections available error in forward input .

https://docs.fluentbit.io/manual/administration/networking#max-connections-per-worker

sandervandegeijn commented 1 year ago

Running into similar issues on 2.1.8 with TLS. Using default docker container, fluentd forwarding to fluentbit.

Config

[SERVICE]
    log_level    debug

[INPUT]
    Name              forward
    Listen            0.0.0.0
    Port              24002
    Buffer_Chunk_Size 1M
    Buffer_Max_Size   6M
    tls on
    tls.verify off
    tls.crt_file /fluent-bit/etc/self_signed.crt
    tls.key_file /fluent-bit/etc/self_signed.key

# [OUTPUT]
#     Name stdout
#     Match *

[OUTPUT]
    Name        kafka
    Match       *
    Brokers     kafka-1:9091,kafka-2:9092,kafka-3:9093
    Topics      kubernetes-main-ingress
    Timestamp_Format iso8601

[2023/08/06 10:05:51] [debug] [out flush] cb_destroy coro_id=5
[2023/08/06 10:05:51] [debug] [task] destroy task=0x7fe6a2039aa0 (task_id=0)
[2023/08/06 10:05:52] [debug] [socket] could not validate socket status for #41 (don't worry)
[2023/08/06 10:05:53] [debug] [socket] could not validate socket status for #43 (don't worry)
[2023/08/06 10:05:55] [debug] [socket] could not validate socket status for #41 (don't worry)
[2023/08/06 10:05:56] [debug] [socket] could not validate socket status for #40 (don't worry)
[2023/08/06 10:06:01] [debug] [socket] could not validate socket status for #43 (don't worry)
[2023/08/06 10:06:01] [debug] [socket] could not validate socket status for #44 (don't worry)
[2023/08/06 10:06:02] [debug] [input chunk] update output instances with new chunk size diff=34495, records=28, input=forward.0
[2023/08/06 10:06:02] [debug] [socket] could not validate socket status for #43 (don't worry)
[2023/08/06 10:06:02] [debug] [task] created task=0x7fe6a2039780 id=0 OK
[2023/08/06 10:06:03] [debug] [socket] could not validate socket status for #44 (don't worry)
{"stream"=>"[2023/08/06 10:06:03] [debug] in produce_message

Log level error

Fluent Bit v2.1.8
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/08/06 10:29:47] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/06 10:29:47] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/06 10:29:48] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/06 10:29:48] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/06 10:29:51] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/06 10:29:51] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

sandervandegeijn commented 1 year ago

Any updates on this? This is killing my production environment performance :)

agup006 commented 1 year ago

@ict-one-nl could you paste the fluentd config, mainly the match statement

sandervandegeijn commented 1 year ago

I asked around, is this what you were asking for?

<label @xxxxx>
  <match kubernetes.**>
    @type tag_normaliser
    @id flow:xxxxx:xxxxx:0
    format ${namespace_name}.${pod_name}.${container_name}
  </match>
  <filter **>
    @type parser
    @id flow:xxxx:xxxxx:1
    key_name message
    remove_key_name_field true
    reserve_data true
    <parse>
      @type json
    </parse>
  </filter>
  <match **>
    @type forward
    @id flow:xxxx:xxxx:output:xxxx:xxxxx-logging
    tls_allow_self_signed_cert true
    tls_insecure_mode true
    transport tls
    <buffer tag,time>
      @type file
      chunk_limit_size 8MB
      path /buffers/flow:xxx:xxx:output:xxx:xxxxxx.*.buffer
      retry_forever true
      timekey 10m
      timekey_wait 1m
    </buffer>
    <server>
      host xxxxxxxx.nl
      port 24002
    </server>
  </match>
</label>

agup006 commented 1 year ago

Thanks @ict-one-nl , I'm wondering if the buffer size needs to be larger on the Fluent Bit side to match what you have there with 8MB chunk limit. You may want to try and lower that on Fluentd side as well

sandervandegeijn commented 1 year ago

I have tried the larger buffer size:

[SERVICE]
    log_level                       error

[INPUT]
    Name                            forward
    Listen                          0.0.0.0
    Port                            24002
    Buffer_Chunk_Size               8M
    Buffer_Max_Size                 128M
    tls                             on
    tls.verify                      off
    tls.crt_file                    /fluent-bit/etc/self_signed.crt
    tls.key_file                    /fluent-bit/etc/self_signed.key

[OUTPUT]
    Name                            kafka
    Match                           *
    Brokers                         kafka-1:9091,kafka-2:9092,kafka-3:9093
    Topics                          kubernetes-main-ingress
    Timestamp_Format                iso8601

# [OUTPUT]
#     Name                            stdout
#     Match                           *

[2023/08/23 14:52:31] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:34] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:34] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:36] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:36] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:38] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:38] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:52:44] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:52:44] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

Same result. Will try the lower chunk size as well

sandervandegeijn commented 1 year ago

FluentD config:

<label @3131968136fc14962a3d7a781ba6abe4>
   <match kubernetes.**>
     @type tag_normaliser
     @id flow:nginx-ingress:mxxx:0
     format "${namespace_name}.${pod_name}.${container_name}"
   </match>
   <filter **>
     @type parser
     @id flow:nginx-ingress:mxxx:1
     key_name "message"
     remove_key_name_field true
     reserve_data true
     <parse>
       @type "json"
     </parse>
   </filter>
   <match **>
     @type forward
     @id flow:nginx-ingress:mxxx:output:nginx-ingress:xxxlogging
     tls_allow_self_signed_cert true
     tls_insecure_mode true
     transport tls
     <buffer tag,time>
       @type "file"
       chunk_limit_size 1MB
       path "/buffers/flow:nginx-ingress:mxxx:output:nginx-ingress:xxxlogging.*.buffer"
       retry_forever true
       timekey 10m
       timekey_wait 1m
     </buffer>
     <server>
       host "xxxx"
       port 24002
     </server>
   </match>
</label>

Fluent-bit config

[SERVICE]
    log_level                       error

[INPUT]
    Name                            forward
    Listen                          0.0.0.0
    Port                            24002
    Buffer_Chunk_Size               1M
    Buffer_Max_Size                 128M
    tls                             on
    tls.verify                      off
    tls.crt_file                    /fluent-bit/etc/self_signed.crt
    tls.key_file                    /fluent-bit/etc/self_signed.key

[OUTPUT]
    Name                            kafka
    Match                           *
    Brokers                         kafka-1:9091,kafka-2:9092,kafka-3:9093
    Topics                          kubernetes-main-ingress
    Timestamp_Format                iso8601

# [OUTPUT]
#     Name                            stdout
#     Match                           *

[2023/08/23 14:55:44] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:55:44] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:55:50] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:55:50] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:03] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:03] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:06] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:06] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:06] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:06] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2023/08/23 14:56:15] [error] [/src/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/08/23 14:56:15] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

Does seem to lower the error rate a bit, but no solution.

YouShallNotCrash commented 1 year ago

I observe the same issue with v2.1.8. How is your instance deployed? In my case in directly on VM (Ubuntu 20.04)

sandervandegeijn commented 1 year ago

This is the default fluent-bit container hosted in Docker on RHEL8

YouShallNotCrash commented 1 year ago

In my case, the setup is Fluent-bit_1 (on external k8s; Forward output plugin)-> Fluent-bit_2 (on Azure VM; Forward input plugin + Kafka output plugin) -> Kafka... Some (?) logs are flowing, although Fluent-bit_2 instance shows repetitive errors:

[2023/09/11 09:06:50] [error] [/tmp/fluent-bit/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2023/09/11 09:06:50] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

with occasional:


[2023/09/12 18:00:14] [error] [tls] error: unexpected EOF
[2023/09/12 18:00:14] [error] [input:forward:forward.1] could not accept new connection

YouShallNotCrash commented 10 months ago

Not much happening here I see... :( If I can provide any more info, which could be useful to understand the TLS issue please advise. Update to 2.2.0 didn't help.

sandervandegeijn commented 10 months ago

I'm sorry to say but we have moved away from fluentbit for most use cases because of this and because solving it takes quite long.

ksauzz commented 10 months ago

I'm sorry to say too. I deployed nginx servers as reverse proxy to terminate TLS instead. It has been very stable so far.

daberlin commented 10 months ago

Solved in my case by lowering the net.keepalive_idle_timeout (in my case to 30 sec). I assume that fluent-bit assumed the connections to be alive, while the server side had already discarded them.

LukasJerabek commented 10 months ago

Well 30 is supposed to be the default isnt it? https://docs.fluentbit.io/manual/administration/networking

daberlin commented 10 months ago

My bad... 30 sec was the original timeout - I lowered it to 10 sec. Anyway I'm still not sure whether the error and net.keepalive_idle_timeout are related or not...

mansi1597 commented 10 months ago

I faced the same issue recently. My fluent bit pods were running behind a kubernetes load balancer which was sending health probes. These health probes were causing the "[error] [tls] error: unexpected EOF" errors. To fix this, I modified the externalTrafficPolicy to Local and updated the healthCheckNodePort. This ensures Kubernetes LB send the health probes on a separate port. Refer this for configuration: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip

LukasJerabek commented 9 months ago

This is happening to us too. In logs I see:

[2024/01/17 12:39:26] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:39:26] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 12:52:20] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:52:20] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 12:54:22] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:54:22] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

We have connections with fluent-bit and fluentd, however I am not currently able to say from which one this origins.

LukasJerabek commented 8 months ago

Well I have just noticed that lots of data are in fact really missing.

If I added require_ack_response true on fluentd side the data starts flowing and error disappears in fluentbit logs. However fluentd side cpu rises a lot, thats probably because he does not get the ack response and has to resend the messages that were not accepted (just a guess). So that suggests that there really is something wrong on fluentbits side.

Could someone please look into this? Seems like quite a problem, in our case it influences all metrics from fluentd.

LukasJerabek commented 8 months ago

We had to use http plugins instead of forward for fluentd->fluentbit communication, that works. I would recommend the same, because one or the other is doing something wrong and given that it involves two separate projects and how long this issue is open it doesnt seem to be solved soon. However fluentbit->fluentbit works in our case both with forward and http plugin. in case of fluentd->fluentbit when swapping forward plugin for http plugin it is necessary to update it like this in fluentd conf:

    <format>
      @type json
    </format>
    json_array true

and to prepend the logs with similar filter:

  <filter **>
    @type record_transformer
    <record>
      tag my_server_pretag.${tag}
    </record>
  </filter>

Then on fluentbit side you just add to config:

    tag_key           tag

Dont forget to set the endpoint address with httpS, which I overlooked at first and is hard to debug.

Gon-Infosum commented 8 months ago

This is happening to us too. In logs I see:

[2024/01/17 12:39:26] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:39:26] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 12:52:20] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:52:20] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 12:54:22] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 12:54:22] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/01/17 13:09:25] [error] [/home/vagrant/source/fluent-bit/fluent-bit-2.2.0/src/tls/openssl.c:433 errno=104] Connection reset by peer
[2024/01/17 13:09:25] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib

We have connections with fluent-bit and fluentd, however I am not currently able to say from which one this origins.

We came across this error when playing around with the keep_alive settings. We purposefully increased it beyond the Load Balancer keep alive and we got that error when, I believe, got past the 60 sec LB timeout and it closed the connection.

Gon-Infosum commented 8 months ago

As a slight update to the OP: we're still trying to clear the source of the error, but managed to clear a whole lot of them using ksniff (kubeshark was a bit too invasive for our taste). We then identified that our Prometheus tags/annotations for the fluent-bit server instance were misconfigured and Prometheus was trying to scrape that endpoint. That cleared a huge chunk of the errors for us but we're still trying to figure out the source of the few remaining entries.

Gon-Infosum commented 7 months ago

We've gone ahead and enabled metrics and have been monitoring our setup. We got some new insights:

Sometimes we get bursts of traffic which lead to spikes in retries, sometimes leading to dropped chunks. This is probably due to the retries failing;
TLS errors have a strong correlation with these burst windows;
Adjusting the service scheduler.base and the Output net.connect_timeout and Workers lessened the amount of retries and so far we've yet to spot any dropped messages. However we're still witnessing TLS errors in the same timeframe as these retries;

Based on the above it seems there's something amiss when the two fluent-bit instances end up terminating the connection and go for a retry. Any suggestions on what we could do next to try and help address the issue?

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

drbugfinder-work commented 4 months ago

still open

nikhilv14 commented 1 month ago

facing similar issue. Sender app (teleport) is using fluentd library with following config:

ca = "/opt/event-handler/ca.crt"
cert = "/opt/event-handler/client.crt"
key = "/opt/event-handler/client.key"
url = "https://localhost:8888/test.log"
session-url = "https://localhost:8888/session"

however, http input plugin on fluent-bit is unable to accept the connection

[2024/09/09 11:01:00] [error] [tls] error: unexpected EOF
[2024/09/09 11:01:00] [debug] [downstream] connection #51 failed

even sending sample data via curl although accepted but logs show :

[2024/09/09 11:12:55] [error] [/tmp/fluent-bit/src/tls/openssl.c:551 errno=0] Success
[2024/09/09 11:12:55] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/09/09 11:12:55] [debug] [socket] could not validate socket status for #51 (don't worry)
[2024/09/09 11:12:55] [debug] [task] created task=0x7fba6d036640 id=0 OK
[2024/09/09 11:12:55] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] test.log: [[1725880375.021842822, {}], {"json"=>"{"foo":"bar"}"}]
[2024/09/09 11:12:55] [debug] [out flush] cb_destroy coro_id=1
[2024/09/09 11:12:55] [debug] [task] destroy task=0x7fba6d036640 (task_id=0)

fluent-bit input config:

[INPUT]
    name http
    listen 0.0.0.0
    port 8888
    threaded On
    tls On
    tls.verify Off
    tls.debug 4
    tls.ca_file /opt/event-handler/ca.crt
    tls.crt_file /opt/event-handler/server.crt
    tls.key_file /opt/event-handler/server.key
    tls.key_passwd xxxxxx

Any pointers to solve this will be helpful.

@LukasJerabek aiming for similar setup as yours: fluentd -> fluent-bit on http with mtls

sandervandegeijn commented 1 month ago

I'm also considering another use case where fluentbit is a better fit than Vector (wineventlog). But the fact that this still hasn't been fixed is holding us back and is worrisome. There has been a whole new stable release in the meantime and this is not some small bug, http over tls is a very common scenario to forward logs with.

maxsargentdev commented 3 weeks ago

I am also seeing this, I have a .NET application posting high volumes of data to the HTTP endpoint, with TLS enabled.

We see dropped data and slow processing from fluentbit, with timeouts on the client side.

fluent / fluent-bit

Forward could not accept new connection/tls unexpected EOF errors #7434

Bug Report