aws / aws-for-fluent-bit

The source of the amazon/aws-for-fluent-bit container image
Apache License 2.0
462 stars 134 forks source link

[engine] caught signal (SIGSEGV) #351

Open LucasHantz opened 2 years ago

LucasHantz commented 2 years ago

When trying to upgrade to the latest version of Firelens (v2.25.0 packaged with FluentBit v1.9.3), we regularly have our socket cut "End-of-file reached, probably we got disconnected (sent 0 of 294)"

[1mFluent Bit v1.9.3

Going back to v2.23.3 packaged with Fluent Bit v1.8.15

PettitWesley commented 2 years ago

@LucasHantz Can you please attach your task def and any custom Fluent Bit config?

Have you seen this crash consistently? Does it only happen after Fluent Bit receives a SIGTERM?

Also:

we regularly have our socket cut "End-of-file reached, probably we got disconnected (sent 0 of 294)"

I don't see this in the log output you shared, can you share full log output with this?

LucasHantz commented 2 years ago

Hello @PettitWesley The issue with the Socket being cut happens even more frequently than the SIGSEGV which seems to occur. Find attached the configuration: firelens.zip

The log is from our internal application which is forwarding the logs to tcp://127.0.0.1:5170 It's a PHP application using the logging library: https://github.com/Seldaek/monolog/blob/main/src/Monolog/Handler/SocketHandler.php

PettitWesley commented 2 years ago

@LucasHantz Interesting. Recently I worked on testing similar reports of failures with the TCP input- I forgot to post the results of my investigation publicly, I will do that ASAP. What is your approximate throughput emitted to the TCP input?

My tests were with the log4j tcp appender: https://github.com/aws/aws-for-fluent-bit/tree/mainline/troubleshooting/tools/log4j-tcp-app

My only finding was that adding workers to ALL outputs significantly improves the throughput that the TCP input can accept. This is likely because without workers, the entirety of Fluent Bit is one thread, with both inputs and outputs fighting for time: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency

However, once you add workers, the outputs get their own separate threads, freeing up the main thread to focus exclusively on inputs. I still think Fluent Bit needs workers for inputs to truly scale, but I didn't see workers for your outputs in your config, so please try adding that.

LucasHantz commented 2 years ago

Is there some way to get that stat of throughput from FluentBit? Quite difficult to get from the application level as the PHP-FPM forks multiples process which can handle traffic and each one has a socket open with the same TCP input.

All right I'll try it with a worker on each output, and see how it goes but does that explain the SIGSEGV as well? Also, you see any particular reason this would happen only on the new version? (as this is not occurring currently with the v2.23.3 with Fluent Bit v1.8.15) with the same traffic.

PettitWesley commented 2 years ago

does that explain the SIGSEGV as well?

The SIGSEGV must be a different issue. Which I will try my best to repro.

Also, you see any particular reason this would happen only on the new version? (as this is not occurring currently with the v2.23.3 with Fluent Bit v1.8.15) with the same traffic.

To be clear here, are you talking about the SIGSEGV issue or the TCP input issue? I think there are two separate issues here.

PettitWesley commented 2 years ago

@LucasHantz Also wait... I realized from your config you are using the older lower performance go plugin, you need to migrate in addition to enabling workers: https://github.com/aws/amazon-kinesis-firehose-for-fluent-bit#new-higher-performance-core-fluent-bit-plugin

Also, actually now that I look at your logs again I am confused:

time="2022-05-12T09:28:54Z" level=info msg="[cloudwatch 0] Created log stream xxx in group xxx" [2022/05/12 09:41:19] [engine] caught signal (SIGTERM) [2022/05/12 09:41:19] [ info] [input] pausing forward.0 [2022/05/12 09:41:19] [ info] [input] pausing forward.1 [2022/05/12 09:41:19] [ info] [input] pausing logs [2022/05/12 09:41:19] [ info] [input] pausing metrics [2022/05/12 09:41:19] [ info] [input] pausing io

I see here you are also using the low performance CW plugin, that needs to be migrated as well: https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#new-higher-performance-core-fluent-bit-plugin

Also I am confused here because the logs don't match up with the config you shared- I don't see the CW plugin used in your config, and I don't see inputs with aliases logs, metrics or io. Are the logs and config from the same test run?

PettitWesley commented 2 years ago

Is there some way to get that stat of throughput from FluentBit?

You can try the monitoring interface and divide by uptime: https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit

Or see: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/send-fb-internal-metrics-to-cw

PettitWesley commented 2 years ago

@LucasHantz Please see the results of my TCP input testing here: https://github.com/aws/aws-for-fluent-bit/issues/294

mfn commented 2 years ago

FTR, today I upgraded to 2.26.0 on 2 clusters:

"Luckily" it crashed only on the staging system 😅

Going back to 2.25.1 worked for me. I realize we're talking about different versions here, as in my case I suspect it might be related to fluent-bit 1.9.4 (=2.26.0) vs. fluent-bit 1.9.3 (=2.25.1)

tatsuo48 commented 2 years ago

SIGSEGV is also generated after receiving SIGTERM in v2.24.0 as well. Fixed by downgrading to v2.23.4.

PettitWesley commented 2 years ago

We believe this crash report is likely the same as is described here: https://github.com/fluent/fluent-bit/issues/5753#issuecomment-1241174476

The fix will be released in 2.28.1 https://github.com/aws/aws-for-fluent-bit/pull/418

maxflowers89 commented 9 months ago

Hi all, we are facing the same error message

[engine] caught signal (SIGSEGV)

which corresponds to FluentBit pods' restarts and it's kind of annoying. We are using the the aws-for-fluent-bit:2.32.0 image that packages the version 1.9.10 of FluentBit. Is there any way to get this issue fixed without any downgrade of the lib?

prashanthzen commented 3 weeks ago

Hi all, we are facing the same error message [engine] caught signal (SIGSEGV)

We are using the aws-for-fluent-bit:2.32.4 image. Any suggestions on how to fix this issue?

This is happening across multiple tasks in our ECS clusters.