fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.86k stars 1.59k forks source link

stackdriver output plugin broken on arm32v7 docker images since v3.0.0 #8785

Open rmsaad opened 6 months ago

rmsaad commented 6 months ago

Bug Report

Describe the bug

The stackdriver output plugin has been broken for arm32v7 release builds (ie. docker images) since v3.0.0.

I have done some digging and this does not seem to occur because of any recently introduced bugs. Instead it seems that previous to this commit: https://github.com/fluent/fluent-bit/commit/71746b35718e856a5f8615f95f35d450a142e8cd setting FLB_RELEASE=On wouldn't build a release binary unless FLB_DEBUG was also explicitly turned off, so the docker images always included a debug build of fluent-bit until v3.0.0.

To Reproduce

  1. Build release build (FLB_RELEASE=On) for arm32v7 on any commit since: https://github.com/fluent/fluent-bit/commit/71746b35718e856a5f8615f95f35d450a142e8cd.
  2. Add the stackdriver as an output to you .conf file.
  3. fluent-bit will crash with SIGSEGV signal.
Fluent Bit v3.0.2
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

___________.__                        __    __________.__  __          ________
\_   _____/|  |  __ __   ____   _____/  |_  \______   \__|/  |_  ___  _\_____  \
 |    __)  |  | |  |  \_/ __ \ /    \   __\  |    |  _/  \   __\ \  \/ / _(__  <
 |     \   |  |_|  |  /\  ___/|   |  \  |    |    |   \  ||  |    \   / /       \
 \___  /   |____/____/  \___  >___|  /__|    |______  /__||__|     \_/ /______  /
     \/                     \/     \/               \/                        \/

[2024/05/02 15:51:25] [ info] [fluent bit] version=3.0.2, commit=33ce918351, pid=19704
[2024/05/02 15:51:25] [ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/05/02 15:51:25] [ info] [cmetrics] version=0.7.3
[2024/05/02 15:51:25] [ info] [ctraces ] version=0.4.0
[2024/05/02 15:51:25] [ info] [input:cpu:cpu.0] initializing
[2024/05/02 15:51:25] [ info] [input:cpu:cpu.0] storage_strategy='memory' (memory only)
[2024/05/02 15:51:25] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2024/05/02 15:51:25] [ warn] [output:stackdriver:stackdriver.0] GOOGLE_APPLICATION_CREDENTIALS and GOOGLE_SERVICE_CREDENTIALS are both defined. Defaulting to GOOGLE_APPLICATION_CREDENTIALS
[2024/05/02 15:51:25] [ info] [oauth2] HTTP Status=200
[2024/05/02 15:51:25] [ info] [oauth2] access token from 'oauth2.googleapis.com:443' retrieved
[2024/05/02 15:51:25] [ info] [sp] stream processor started
[2024/05/02 15:51:25] [ info] [output:stackdriver:stackdriver.0] worker #0 started
[2024/05/02 15:52:24] [engine] caught signal (SIGSEGV)
Aborted (core dumped)

In the core dump output below the stack is corrupted and causes dns_ctx to get the illegal address: 0x17cbb6dd.

#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
#1  0x76a5ba20 in __libc_signal_restore_set (set=0x749f937c) at ../sysdeps/unix/sysv/linux/internal-signals.h:86
#2  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:48
#3  0x76a4c322 in __GI_abort () at abort.c:79
#4  0x005362c6 in flb_signal_handler (signal=<optimized out>) at /src/fluent-bit/src/fluent-bit.c:602
#5  <signal handler called>
#6  0x0059efd8 in flb_net_dns_lookup_context_cleanup (dns_ctx=dns_ctx@entry=0x17cbb6dd) at /src/fluent-bit/src/flb_network.c:613
#7  0x00599720 in output_thread (data=0x7608fb80) at /src/fluent-bit/src/flb_output_thread.c:329
#8  0x005a8a0c in step_callback (data=0x7607f1c0) at /src/fluent-bit/src/flb_worker.c:43
#9  0x76f4999e in start_thread (arg=0x273f295a) at pthread_create.c:477
#10 0x76ad202c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Expected behavior

The stackdriver output plugin should work on arm32v7 release build or at least docker images work. I tested and this isn't a problem for arm64 or x86.

Your Environment

[INPUT] Name cpu Tag gateway_cpu Interval_Sec 20

[FILTER] Name modify Match * Add labels.gateway_env development

[FILTER] Name nest Match Operation nest Wildcard labels. Nest_under logging.googleapis.com/labels Remove_prefix labels.

[OUTPUT] Name stackdriver Match * resource generic_node namespace ${DEV_CODE} node_id ${DEV_ID} location northamerica-northeast1-c severity_key level


* Operating System and version:
    Test on: 
      - Raspbian GNU/Linux 11 (bullseye) (raspberry pi 2b)
      - Debian (bullseye) (embedded Linux pc)
* Filters and plugins: See config above.

**Additional context**

I have attached valgrind output below:
[valgrind-out.txt](https://github.com/fluent/fluent-bit/files/15191368/valgrind-out.txt)

I will be falling back to the v2.2 docker image for now.
braydonk commented 6 months ago

FYI @edsiper @leonardo-albertovich @nokute78 this likely affects any threaded input plugin on this platform, not just out_stackdriver. The segfault occurs in the generic output thread loop: https://github.com/fluent/fluent-bit/blob/07475e71ea4b9e8cdcd34154de5f89540b916171/src/flb_output_thread.c#L329

Haven't actually run it but the only way a segfault makes sense in this stacktrace is if &dns_ctx is a bad address.

rmsaad commented 6 months ago
(gdb) break /src/fluent-bit/src/flb_output_thread.c:329
Breakpoint 1 at 0xe9718: file /src/fluent-bit/src/flb_output_thread.c, line 329.
(gdb) define print_sp
Type commands for definition of "print_sp".
End with a line saying just "end".
>x/40x $sp
>print &dns_ctx
>step
>print dns_ctx
>continue
>end
(gdb) run -c /etc/fluent/fluent-bit.conf

I created a command in gdb to print out stack memory, &dns_ctx, then step into flb_net_dns_lookup_context_cleanup() and print out dns_ctx. The stack memory looks weird right before the seg fault.

Thread 4 "flb-out-stackdr" hit Breakpoint 1, output_thread (data=0x760f7b80) at /src/fluent-bit/src/flb_output_thread.c:329
329     /src/fluent-bit/src/flb_output_thread.c: No such file or directory.
(gdb)
0x749f9920:     0x00890664      0x76170200      0x00000000      0x00000000
0x749f9930:     0x00000000      0x00000000      0x00000000      0x00000008
0x749f9940:     0x00000000      0x00000008      0x00000000      0x755d0000
0x749f9950:     0x761701c0      0x00000000      0x00000000      0xdeadbeef
0x749f9960:     0x760f7bdc      0x00000000      0x00000000      0x00000000
0x749f9970:     0x00000000      0x00000000      0x00000000      0x00000000
0x749f9980:     0x00000000      0x00000000      0x00000000      0x00000000
0x749f9990:     0x00000000      0x749f9994      0x749f9994      0x749f999c
0x749f99a0:     0x749f999c      0x00000023      0x00008000      0x00000001
0x749f99b0:     0x00000002      0x00000000      0x00000000      0x00000000
$47 = (struct flb_net_dns *) 0x749f9994
flb_net_dns_lookup_context_cleanup (dns_ctx=0x749f9994) at /src/fluent-bit/src/flb_network.c:613
613     /src/fluent-bit/src/flb_network.c: No such file or directory.
$48 = (struct flb_net_dns *) 0x749f9994

Thread 4 "flb-out-stackdr" hit Breakpoint 1, output_thread (data=0x760f7b80) at /src/fluent-bit/src/flb_output_thread.c:329
329     /src/fluent-bit/src/flb_output_thread.c: No such file or directory.
(gdb)
0x749f9920:     0x00890664      0x76170200      0x00000000      0x00000000
0x749f9930:     0x00000000      0x00000000      0x00000000      0x00000008
0x749f9940:     0x00000000      0x00000008      0x00000000      0x755d0000
0x749f9950:     0x761701c0      0x00000000      0x00000000      0xdeadbeef
0x749f9960:     0x760f7bdc      0x00000000      0x00000000      0x00000000
0x749f9970:     0x00000000      0x00000000      0x00000000      0x00000000
0x749f9980:     0x00000000      0x00000000      0x00000000      0x00000000
0x749f9990:     0x00000000      0x749f9994      0x749f9994      0x749f999c
0x749f99a0:     0x749f999c      0x00000023      0x00008000      0x00000001
0x749f99b0:     0x00000002      0x00000000      0x00000000      0x00000000
$49 = (struct flb_net_dns *) 0x749f9994
flb_net_dns_lookup_context_cleanup (dns_ctx=0x749f9994) at /src/fluent-bit/src/flb_network.c:613
613     /src/fluent-bit/src/flb_network.c: No such file or directory.
$50 = (struct flb_net_dns *) 0x749f9994

Thread 4 "flb-out-stackdr" hit Breakpoint 1, output_thread (data=0x760f7b80) at /src/fluent-bit/src/flb_output_thread.c:329
329     /src/fluent-bit/src/flb_output_thread.c: No such file or directory.
(gdb)
0x749f9920:     0x00890664      0x76170200      0x00000000      0x00000000
0x749f9930:     0x00000000      0x00000000      0x00000000      0x00000008
0x749f9940:     0x00000000      0x00000008      0x00000000      0x755d0000
0x749f9950:     0x761701c0      0x00000000      0x00000000      0xdeadbeef
0x749f9960:     0x760f7bdc      0x00000000      0x00000000      0x00000000
0x749f9970:     0x00000000      0x00000000      0x00000000      0x00000000
0x749f9980:     0x00000000      0x00000000      0x00000000      0x00000000
0x749f9990:     0x00000000      0x749f9994      0x749f9994      0x749f999c
0x749f99a0:     0x749f999c      0x00000023      0x00008000      0x00000001
0x749f99b0:     0x00000002      0x00000000      0x00000000      0x00000000
$51 = (struct flb_net_dns *) 0x749f9994
flb_net_dns_lookup_context_cleanup (dns_ctx=0x749f9994) at /src/fluent-bit/src/flb_network.c:613
613     /src/fluent-bit/src/flb_network.c: No such file or directory.
$52 = (struct flb_net_dns *) 0x749f9994

Thread 4 "flb-out-stackdr" hit Breakpoint 1, output_thread (data=0x760f7b80) at /src/fluent-bit/src/flb_output_thread.c:329
329     /src/fluent-bit/src/flb_output_thread.c: No such file or directory.
(gdb)
0x749f9920:     0x00890664      0x76170200      0x00000000      0x00000000
0x749f9930:     0x00000000      0x00000000      0x00000000      0x7552f000
0x749f9940:     0x760c1560      0x761b6000      0x760f7bec      0x755d0000
0x749f9950:     0x761701c0      0x00000000      0x00000000      0xdeadbeef
0x749f9960:     0x760f7bdc      0x00000000      0x760c1560      0x00000000
0x749f9970:     0x00000000      0x00006100      0x00000000      0x00000000
0x749f9980:     0x00000000      0x00000000      0x00000000      0x00000000
0x749f9990:     0x00000000      0x754e7074      0x754e7074      0x749f999c
0x749f99a0:     0x749f999c      0x00000023      0x00008000      0x00000001
0x749f99b0:     0x00000002      0x00000000      0x00000000      0x00000000
$53 = (struct flb_net_dns *) 0x749f9994
flb_net_dns_lookup_context_cleanup (dns_ctx=0x17cbc827) at /src/fluent-bit/src/flb_network.c:613
613     /src/fluent-bit/src/flb_network.c: No such file or directory.
$54 = (struct flb_net_dns *) 0x17cbc827

Thread 4 "flb-out-stackdr" received signal SIGSEGV, Segmentation fault.
0x004eefd8 in flb_net_dns_lookup_context_cleanup (dns_ctx=0x17cbc827) at /src/fluent-bit/src/flb_network.c:613
github-actions[bot] commented 3 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

chrisguidry commented 1 month ago

I believe I'm experiencing this same segfault on ARM with all 3.x version with both the http and file output plugins, but interestingly not with the stdout output plugin. I also agree it's not specific to stackdriver. I also agree that it seems to be something happening after the output plugins are processed, because I can add two file output plugins, and both files will be written, then the process crashes with the SIGSEGV.

I'm not as comfortable building debug versions of fluent-bit to get stacktraces here, but I can say that the issue exists on all 3.x versions and not on 2.2.3.