Open LucasHantz opened 2 years ago
@LucasHantz Can you please attach your task def and any custom Fluent Bit config?
Have you seen this crash consistently? Does it only happen after Fluent Bit receives a SIGTERM?
Also:
we regularly have our socket cut "End-of-file reached, probably we got disconnected (sent 0 of 294)"
I don't see this in the log output you shared, can you share full log output with this?
Hello @PettitWesley The issue with the Socket being cut happens even more frequently than the SIGSEGV which seems to occur. Find attached the configuration: firelens.zip
The log is from our internal application which is forwarding the logs to tcp://127.0.0.1:5170 It's a PHP application using the logging library: https://github.com/Seldaek/monolog/blob/main/src/Monolog/Handler/SocketHandler.php
@LucasHantz Interesting. Recently I worked on testing similar reports of failures with the TCP input- I forgot to post the results of my investigation publicly, I will do that ASAP. What is your approximate throughput emitted to the TCP input?
My tests were with the log4j tcp appender: https://github.com/aws/aws-for-fluent-bit/tree/mainline/troubleshooting/tools/log4j-tcp-app
My only finding was that adding workers to ALL outputs significantly improves the throughput that the TCP input can accept. This is likely because without workers, the entirety of Fluent Bit is one thread, with both inputs and outputs fighting for time: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency
However, once you add workers, the outputs get their own separate threads, freeing up the main thread to focus exclusively on inputs. I still think Fluent Bit needs workers for inputs to truly scale, but I didn't see workers for your outputs in your config, so please try adding that.
Is there some way to get that stat of throughput from FluentBit? Quite difficult to get from the application level as the PHP-FPM forks multiples process which can handle traffic and each one has a socket open with the same TCP input.
All right I'll try it with a worker on each output, and see how it goes but does that explain the SIGSEGV as well? Also, you see any particular reason this would happen only on the new version? (as this is not occurring currently with the v2.23.3 with Fluent Bit v1.8.15) with the same traffic.
does that explain the SIGSEGV as well?
The SIGSEGV must be a different issue. Which I will try my best to repro.
Also, you see any particular reason this would happen only on the new version? (as this is not occurring currently with the v2.23.3 with Fluent Bit v1.8.15) with the same traffic.
To be clear here, are you talking about the SIGSEGV issue or the TCP input issue? I think there are two separate issues here.
@LucasHantz Also wait... I realized from your config you are using the older lower performance go plugin, you need to migrate in addition to enabling workers: https://github.com/aws/amazon-kinesis-firehose-for-fluent-bit#new-higher-performance-core-fluent-bit-plugin
Also, actually now that I look at your logs again I am confused:
time="2022-05-12T09:28:54Z" level=info msg="[cloudwatch 0] Created log stream xxx in group xxx" [2022/05/12 09:41:19] [engine] caught signal (SIGTERM) [2022/05/12 09:41:19] [ info] [input] pausing forward.0 [2022/05/12 09:41:19] [ info] [input] pausing forward.1 [2022/05/12 09:41:19] [ info] [input] pausing logs [2022/05/12 09:41:19] [ info] [input] pausing metrics [2022/05/12 09:41:19] [ info] [input] pausing io
I see here you are also using the low performance CW plugin, that needs to be migrated as well: https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#new-higher-performance-core-fluent-bit-plugin
Also I am confused here because the logs don't match up with the config you shared- I don't see the CW plugin used in your config, and I don't see inputs with aliases logs, metrics or io. Are the logs and config from the same test run?
Is there some way to get that stat of throughput from FluentBit?
You can try the monitoring interface and divide by uptime: https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit
@LucasHantz Please see the results of my TCP input testing here: https://github.com/aws/aws-for-fluent-bit/issues/294
FTR, today I upgraded to 2.26.0 on 2 clusters:
[engine] caught signal (SIGSEGV)
"Luckily" it crashed only on the staging system 😅
Going back to 2.25.1 worked for me. I realize we're talking about different versions here, as in my case I suspect it might be related to fluent-bit 1.9.4 (=2.26.0) vs. fluent-bit 1.9.3 (=2.25.1)
SIGSEGV is also generated after receiving SIGTERM in v2.24.0 as well. Fixed by downgrading to v2.23.4.
We believe this crash report is likely the same as is described here: https://github.com/fluent/fluent-bit/issues/5753#issuecomment-1241174476
The fix will be released in 2.28.1 https://github.com/aws/aws-for-fluent-bit/pull/418
Hi all, we are facing the same error message
[engine] caught signal (SIGSEGV)
which corresponds to FluentBit pods' restarts and it's kind of annoying. We are using the the aws-for-fluent-bit:2.32.0 image that packages the version 1.9.10 of FluentBit. Is there any way to get this issue fixed without any downgrade of the lib?
Hi all, we are facing the same error message [engine] caught signal (SIGSEGV)
We are using the aws-for-fluent-bit:2.32.4 image. Any suggestions on how to fix this issue?
This is happening across multiple tasks in our ECS clusters.
When trying to upgrade to the latest version of Firelens (v2.25.0 packaged with FluentBit v1.9.3), we regularly have our socket cut "End-of-file reached, probably we got disconnected (sent 0 of 294)"
[1mFluent Bit v1.9.3[0m
Going back to v2.23.3 packaged with Fluent Bit v1.8.15