[engine] caught signal (SIGSEGV) - Githubissues

aws / aws-for-fluent-bit

The source of the amazon/aws-for-fluent-bit container image

Apache License 2.0

462 stars 134 forks source link

[engine] caught signal (SIGSEGV) #351

Open LucasHantz opened 2 years ago

LucasHantz commented 2 years ago

When trying to upgrade to the latest version of Firelens (v2.25.0 packaged with FluentBit v1.9.3), we regularly have our socket cut "End-of-file reached, probably we got disconnected (sent 0 of 294)"

[1mFluent Bit v1.9.3[0m

[1m[93mCopyright (C) 2015-2022 The Fluent Bit Authors[0m
Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
https://fluentbit.io [2022/05/12 09:28:48] [ info] [fluent bit] version=1.9.3, commit=a313296229, pid=1 [2022/05/12 09:28:48] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128 [2022/05/12 09:28:48] [ info] [cmetrics] version=0.3.1 [2022/05/12 09:28:48] [ info] [input:forward:forward.0] listening on unix:///var/run/fluent.sock [2022/05/12 09:28:48] [ info] [input:forward:forward.1] listening on 127.0.0.1:24224 [2022/05/12 09:28:48] [ info] [input:tcp:tcp.2] listening on 127.0.0.1:8877 [2022/05/12 09:28:48] [ info] [input:tcp:tcp.3] listening on 127.0.0.1:5170 AWS for Fluent Bit Container Image Version 2.25.0time="2022-05-12T09:28:48Z" level=info msg="A new higher performance Firehose plugin has been released; you are using the old plugin. Check out the new plugin's documentation and consider migrating.\nhttps://docs.fluentbit.io/manual/pipeline/outputs/firehose" time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter delivery_stream = 'xxx'" time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter region = 'eu-west-1'" time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter data_keys = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter role_arn = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter endpoint = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter sts_endpoint = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter time_key = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter time_key_format = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter log_key = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter replace_dots = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 0] plugin parameter simple_aggregation = 'false'" time="2022-05-12T09:28:48Z" level=info msg="A new higher performance Firehose plugin has been released; you are using the old plugin. Check out the new plugin's documentation and consider migrating.\nhttps://docs.fluentbit.io/manual/pipeline/outputs/firehose" time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter delivery_stream = 'xxx'" time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter region = 'eu-west-1'" time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter data_keys = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter role_arn = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter endpoint = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter sts_endpoint = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter time_key = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter time_key_format = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter log_key = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter replace_dots = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 1] plugin parameter simple_aggregation = 'false'" time="2022-05-12T09:28:48Z" level=info msg="A new higher performance Firehose plugin has been released; you are using the old plugin. Check out the new plugin's documentation and consider migrating.\nhttps://docs.fluentbit.io/manual/pipeline/outputs/firehose" time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter delivery_stream = 'xxx'" time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter region = 'eu-west-1'" time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter data_keys = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter role_arn = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter endpoint = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter sts_endpoint = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter time_key = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter time_key_format = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter log_key = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter replace_dots = ''" time="2022-05-12T09:28:48Z" level=info msg="[firehose 2] plugin parameter simple_aggregation = 'false'" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_group_name = 'xxx'" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter default_log_group_name = 'fluentbit-default'" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_stream_prefix = 'web/'" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_stream_name = ''" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter default_log_stream_name = '/fluentbit-default'" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter region = 'eu-west-1'" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_key = 'log'" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter role_arn = ''" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter auto_create_group = 'true'" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter new_log_group_tags = ''" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_retention_days = '0'" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter endpoint = ''" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter sts_endpoint = ''" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter external_id = ''" time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter credentials_endpoint = " time="2022-05-12T09:28:48Z" level=info msg="[cloudwatch 0] plugin parameter log_format = ''" [2022/05/12 09:28:48] [ info] [sp] stream processor started [2022/05/12 09:28:48] [ info] [sp] registered task: logs [2022/05/12 09:28:48] [ info] [sp] registered task: metrics [2022/05/12 09:28:48] [ info] [sp] registered task: io [2022/05/12 09:28:48] [ info] [output:null:null.3] worker #0 started time="2022-05-12T09:28:54Z" level=info msg="[cloudwatch 0] Log group xxx already exists\n" time="2022-05-12T09:28:54Z" level=info msg="[cloudwatch 0] Created log stream xxx in group xxx" [2022/05/12 09:41:19] [engine] caught signal (SIGTERM) [2022/05/12 09:41:19] [ info] [input] pausing forward.0 [2022/05/12 09:41:19] [ info] [input] pausing forward.1 [2022/05/12 09:41:19] [ info] [input] pausing logs [2022/05/12 09:41:19] [ info] [input] pausing metrics [2022/05/12 09:41:19] [ info] [input] pausing io [2022/05/12 09:41:19] [ warn] [engine] service will shutdown in max 30 seconds [2022/05/12 09:41:20] [ info] [engine] service has stopped (0 pending tasks) [2022/05/12 09:41:20] [engine] caught signal (SIGSEGV)

Going back to v2.23.3 packaged with Fluent Bit v1.8.15

PettitWesley commented 2 years ago

@LucasHantz Can you please attach your task def and any custom Fluent Bit config?

Have you seen this crash consistently? Does it only happen after Fluent Bit receives a SIGTERM?

Also:

we regularly have our socket cut "End-of-file reached, probably we got disconnected (sent 0 of 294)"

I don't see this in the log output you shared, can you share full log output with this?

LucasHantz commented 2 years ago

Hello @PettitWesley The issue with the Socket being cut happens even more frequently than the SIGSEGV which seems to occur. Find attached the configuration: firelens.zip

The log is from our internal application which is forwarding the logs to tcp://127.0.0.1:5170 It's a PHP application using the logging library: https://github.com/Seldaek/monolog/blob/main/src/Monolog/Handler/SocketHandler.php

PettitWesley commented 2 years ago

@LucasHantz Interesting. Recently I worked on testing similar reports of failures with the TCP input- I forgot to post the results of my investigation publicly, I will do that ASAP. What is your approximate throughput emitted to the TCP input?

My tests were with the log4j tcp appender: https://github.com/aws/aws-for-fluent-bit/tree/mainline/troubleshooting/tools/log4j-tcp-app

My only finding was that adding workers to ALL outputs significantly improves the throughput that the TCP input can accept. This is likely because without workers, the entirety of Fluent Bit is one thread, with both inputs and outputs fighting for time: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency

However, once you add workers, the outputs get their own separate threads, freeing up the main thread to focus exclusively on inputs. I still think Fluent Bit needs workers for inputs to truly scale, but I didn't see workers for your outputs in your config, so please try adding that.

LucasHantz commented 2 years ago

Is there some way to get that stat of throughput from FluentBit? Quite difficult to get from the application level as the PHP-FPM forks multiples process which can handle traffic and each one has a socket open with the same TCP input.

All right I'll try it with a worker on each output, and see how it goes but does that explain the SIGSEGV as well? Also, you see any particular reason this would happen only on the new version? (as this is not occurring currently with the v2.23.3 with Fluent Bit v1.8.15) with the same traffic.

PettitWesley commented 2 years ago

does that explain the SIGSEGV as well?

The SIGSEGV must be a different issue. Which I will try my best to repro.

Also, you see any particular reason this would happen only on the new version? (as this is not occurring currently with the v2.23.3 with Fluent Bit v1.8.15) with the same traffic.

To be clear here, are you talking about the SIGSEGV issue or the TCP input issue? I think there are two separate issues here.

PettitWesley commented 2 years ago

@LucasHantz Also wait... I realized from your config you are using the older lower performance go plugin, you need to migrate in addition to enabling workers: https://github.com/aws/amazon-kinesis-firehose-for-fluent-bit#new-higher-performance-core-fluent-bit-plugin

Also, actually now that I look at your logs again I am confused:

time="2022-05-12T09:28:54Z" level=info msg="[cloudwatch 0] Created log stream xxx in group xxx" [2022/05/12 09:41:19] [engine] caught signal (SIGTERM) [2022/05/12 09:41:19] [ info] [input] pausing forward.0 [2022/05/12 09:41:19] [ info] [input] pausing forward.1 [2022/05/12 09:41:19] [ info] [input] pausing logs [2022/05/12 09:41:19] [ info] [input] pausing metrics [2022/05/12 09:41:19] [ info] [input] pausing io

I see here you are also using the low performance CW plugin, that needs to be migrated as well: https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#new-higher-performance-core-fluent-bit-plugin

Also I am confused here because the logs don't match up with the config you shared- I don't see the CW plugin used in your config, and I don't see inputs with aliases logs, metrics or io. Are the logs and config from the same test run?

PettitWesley commented 2 years ago

Is there some way to get that stat of throughput from FluentBit?

You can try the monitoring interface and divide by uptime: https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit

Or see: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/send-fb-internal-metrics-to-cw

PettitWesley commented 2 years ago

@LucasHantz Please see the results of my TCP input testing here: https://github.com/aws/aws-for-fluent-bit/issues/294

mfn commented 2 years ago

FTR, today I upgraded to 2.26.0 on 2 clusters:

on one it started to consistently crash with [engine] caught signal (SIGSEGV)
on the other: no issues

"Luckily" it crashed only on the staging system 😅

Going back to 2.25.1 worked for me. I realize we're talking about different versions here, as in my case I suspect it might be related to fluent-bit 1.9.4 (=2.26.0) vs. fluent-bit 1.9.3 (=2.25.1)

tatsuo48 commented 2 years ago

SIGSEGV is also generated after receiving SIGTERM in v2.24.0 as well. Fixed by downgrading to v2.23.4.

PettitWesley commented 2 years ago

We believe this crash report is likely the same as is described here: https://github.com/fluent/fluent-bit/issues/5753#issuecomment-1241174476

The fix will be released in 2.28.1 https://github.com/aws/aws-for-fluent-bit/pull/418

maxflowers89 commented 9 months ago

Hi all, we are facing the same error message

[engine] caught signal (SIGSEGV)

which corresponds to FluentBit pods' restarts and it's kind of annoying. We are using the the aws-for-fluent-bit:2.32.0 image that packages the version 1.9.10 of FluentBit. Is there any way to get this issue fixed without any downgrade of the lib?

prashanthzen commented 3 weeks ago

Hi all, we are facing the same error message [engine] caught signal (SIGSEGV)

We are using the aws-for-fluent-bit:2.32.4 image. Any suggestions on how to fix this issue?

This is happening across multiple tasks in our ECS clusters.