fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.82k stars 1.58k forks source link

Windows Server 2019 fluent-bit.exe, version: 1.9.10.0 crash. #6607

Closed pavlov2000uk closed 1 year ago

pavlov2000uk commented 1 year ago

Bug Report

Describe the bug fluent-bit.exe, version: 1.9.10.0 process crashes from time to time when under load.

Faulting application name: fluent-bit.exe, version: 1.9.10.0, time stamp: 0x6382da67
Faulting module name: fluent-bit.exe, version: 1.9.10.0, time stamp: 0x6382da67
Exception code: 0xc0000005
Fault offset: 0x00000000001a2de3
Faulting process id: 0x940
Faulting application start time: 0x01d9151d667819e2
Faulting application path: C:\Program Files\fluent-bit\bin\fluent-bit.exe
Faulting module path: C:\Program Files\fluent-bit\bin\fluent-bit.exe
Report Id: 90a4fd94-383a-4e20-8579-3de4edff1c07
Faulting package full name: 
Faulting package-relative application ID: 

Steps to reproduce the problem:

  1. Fluent-bit 1.9.10 installed on Server 2019 (vanilla setup). Configuration

    FLUENT-BIT 1.9.10

    myconf.conf

    xxx [SERVICE] Flush 5 Log_Level Trace Daemon false Log_File C:\Program Files\fluent-bit\log\fluent-bit.log Storage.path C:\Program Files\fluent-bit\buffer Storage.sync normal Storage.checksum false Storage.backlog.mem_limit 5M @INCLUDE C:\Program Files\fluent-bit\conf\include_dummy.conf @INCLUDE C:\Program Files\fluent-bit\conf\include_os_metrics.conf xxx

include_dummy.conf

xxx [INPUT] Name tail Tag windows.dummy.${HOSTNAME}

Insert the log name below

Path C:\Program Files\fluent-bit\log\access_log_20221221-111926.log
Read_from_Head false
Path_Key filename
Storage.type filesystem
DB winlog.sqlite
Mem_Buf_Limit 50MB

[OUTPUT] Name forward Match windows.dummy.${HOSTNAME} Upstream C:\Program Files\fluent-bit\conf\upstream.conf storage.total_limit_size 50M alias windows_agents_dummy xxx

include_os_metrics.conf

xxx [INPUT] Name windows_exporter_metrics Tag windows.osmetrics.${HOSTNAME} scrape_interval 2 alias windows_osmetrics [OUTPUT] Name forward Match windows.osmetrics.${HOSTNAME} Upstream C:\Program Files\fluent-bit\conf\upstream.conf storage.total_limit_size 50M alias windows_osmetrics xxx

upstream.conf

xxx [UPSTREAM] name myupstream [NODE] name myupstreamnode1 host 192.168.1.2 port 24224 xxx

  1. Another Server 2019 machine configuration (ip = 192.168.1.2, vanilla setup)

    FLUEND 4.4.2

    ======

    @type forward <match windows.**> @type stdout

  2. C:\Program Files\fluent-bit\log\access_log_20221221-111926.log file records generated with fake generator and the command is -> apache-fake-log-gen.py -n 0 -o LOG (more data from here https://github.com/mihkels/fake-log-generator)

Expected behavior Should not crash

Screenshots No screenshots

Your Environment

Last messages from fluent-bit.log [2022/12/22 13:57:05] [debug] [input chunk] update output instances with new chunk size diff=69632 [2022/12/22 13:57:05] [debug] [input chunk] chunk 2368-1671710225.978788500.flb update plugin forward.0 fs_chunks_size by 69632 bytes, the current fs_chunks_size is 905091 bytes [2022/12/22 13:57:05] [debug] [input:tail:tail.0] 0 new files found on path 'C:\Program Files\fluent-bit\log\access_log_20221221-111926.log' [2022/12/22 13:57:05] [debug] [task] destroy task=000001F220F93DD0 (task_id=1) [2022/12/22 13:57:05] [debug] [input chunk] remove chunk 2368-1671710217.599708800.flb with 7914 bytes from plugin forward.1, the updated fs_chunks_size is 3957 bytes [2022/12/22 13:57:05] [debug] [input chunk] chunk 2368-1671710225.978788500.flb required 308 bytes and 49094909 bytes left in plugin forward.0 [2022/12/22 13:57:15] [debug] [input chunk] chunk 2368-1671710225.978788500.flb update plugin forward.0 fs_chunks_size by -23646 bytes, the current fs_chunks_size is 881445 bytes [2022/12/22 13:57:15] [debug] [task] created task=000001F220F93ED0 id=1 OK [2022/12/22 13:57:15] [debug] [output:forward:windows_agents_tanium] request 45923 bytes to flush [2022/12/22 13:57:15] [debug] [output:forward:windows_osmetrics] request 3957 bytes to flush [2022/12/22 13:57:15] [debug] [output:forward:windows_agents_tanium] task_id=1 assigned to thread #0 [2022/12/22 13:57:15] [debug] [upstream] KA connection #1752 to 192.168.50.149:24224 has been assigned (recycled) [2022/12/22 13:57:15] [debug] [upstream] KA connection #1480 to 192.168.50.195:24224 has been assigned (recycled) [2022/12/22 13:57:15] [debug] [task] created task=000001F21F3AAE40 id=2 OK [2022/12/22 13:57:15] [debug] [upstream] KA connection #1752 to 192.168.50.149:24224 is now available [2022/12/22 13:57:15] [debug] [out flush] cb_destroy coro_id=9609 [2022/12/22 13:57:15] [debug] [upstream] KA connection #1480 to 192.168.50.195:24224 is now available [2022/12/22 13:57:15] [debug] [output:forward:windows_osmetrics] task_id=2 assigned to thread #1 [2022/12/22 13:57:15] [debug] [out flush] cb_destroy coro_id=9603 [2022/12/22 13:57:15] [debug] [input chunk] chunk 2368-1671710235.800341500.flb required 3957 bytes and 49996043 bytes left in plugin forward.1 [2022/12/22 13:57:15] [debug] [input chunk] update output instances with new chunk size diff=3957 [2022/12/22 13:57:15] [debug] [input chunk] chunk 2368-1671710235.800341500.flb update plugin forward.1 fs_chunks_size by 3957 bytes, the current fs_chunks_size is 7914 bytes [2022/12/22 13:57:15] [debug] [input chunk] chunk 2368-1671710235.801076400.flb required 45558 bytes and 49118555 bytes left in plugin forward.0 [2022/12/22 13:57:15] [debug] [input chunk] update output instances with new chunk size diff=69632 [2022/12/22 13:57:15] [debug] [input chunk] chunk 2368-1671710235.801076400.flb update plugin forward.0 fs_chunks_size by 69632 bytes, the current fs_chunks_size is 951077 bytes [2022/12/22 13:57:15] [debug] [task] destroy task=000001F21F3AAE40 (task_id=2) [2022/12/22 13:57:15] [debug] [input chunk] remove chunk 2368-1671710225.977901900.flb with 3957 bytes from plugin forward.1, the updated fs_chunks_size is 3957 bytes

WinDbg output Microsoft (R) Windows Debugger Version 10.0.25200.1003 AMD64 Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [D:\DO_NOT_DELETE\CASES\00804083_Fluent_Bit\CRASH\fluent-bit.exe_221222_135706.dmp] User Mini Dump File with Full Memory: Only application data is available

Comment: ' procdump64.exe -ma -e fluent-bit.exe Unhandled exception: C0000005.ACCESS_VIOLATION'

***** Path validation summary ** Response Time (ms) Location Deferred srv Symbol search path is: srv Executable search path is: Windows 10 Version 17763 MP (4 procs) Free x64 Product: Server, suite: TerminalServer DataCenter SingleUserTS Edition build lab: 17763.1.amd64fre.rs5_release.180914-1434 Machine Name: Debug session time: Thu Dec 22 13:57:08.000 2022 (UTC + 2:00) System Uptime: 1 days 22:53:36.323 Process Uptime: 1 days 2:37:08.000 ............................................ Loading unloaded module list ................................................................ This dump file has an exception of interest stored in it. The stored exception information can be accessed via .ecxr. (940.110): Access violation - code c0000005 (first/second chance not available) For analysis of this file, run !analyze -v fluent_bit!snprintf+0x27fe3: 00007ff6df492de3 4d0fbe10 movsx r10,byte ptr [r8] ds:000001f2215c003f=?? 0:004> !analyze -v


KEY_VALUES_STRING: 1

Key  : AV.Fault
Value: Read

Key  : Analysis.CPU.mSec
Value: 905

Key  : Analysis.DebugAnalysisManager
Value: Create

Key  : Analysis.Elapsed.mSec
Value: 15739

Key  : Analysis.IO.Other.Mb
Value: 14

Key  : Analysis.IO.Read.Mb
Value: 0

Key  : Analysis.IO.Write.Mb
Value: 19

Key  : Analysis.Init.CPU.mSec
Value: 171

Key  : Analysis.Init.Elapsed.mSec
Value: 4262

Key  : Analysis.Memory.CommitPeak.Mb
Value: 81

Key  : Timeline.OS.Boot.DeltaSec
Value: 168816

Key  : Timeline.Process.Start.DeltaSec
Value: 95828

Key  : WER.OS.Branch
Value: rs5_release

Key  : WER.OS.Timestamp
Value: 2018-09-14T14:34:00Z

Key  : WER.OS.Version
Value: 10.0.17763.1

Key  : WER.Process.Version
Value: 1.9.10.0

FILE_IN_CAB: fluent-bit.exe_221222_135706.dmp

COMMENT:
procdump64.exe -ma -e fluent-bit.exe Unhandled exception: C0000005.ACCESS_VIOLATION

NTGLOBALFLAG: 0

APPLICATION_VERIFIER_FLAGS: 0

CONTEXT: (.ecxr) rax=00000000000cbf44 rbx=000000906baff9a0 rcx=000000906baff9a0 rdx=000000906baff950 rsi=000001f220dece30 rdi=000000906baff950 rip=00007ff6df492de3 rsp=000000906baff8f0 rbp=000000906baff920 r8=000001f2215c003f r9=0000000000000030 r10=00007ff6df2f0000 r11=00007ff6df2f0000 r12=0000000000000000 r13=00007ff6df3f4cb0 r14=000001f2215c003f r15=000001f220d841c0 iopl=0 nv up ei pl nz na pe nc cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010200 fluent_bit!snprintf+0x27fe3: 00007ff6df492de3 4d0fbe10 movsx r10,byte ptr [r8] ds:000001f2215c003f=?? Resetting default scope

EXCEPTION_RECORD: (.exr -1) ExceptionAddress: 00007ff6df492de3 (fluent_bit!snprintf+0x0000000000027fe3) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000000 NumberParameters: 2 Parameter[0]: 0000000000000000 Parameter[1]: 000001f2215c003f Attempt to read from address 000001f2215c003f

PROCESS_NAME: fluent-bit.exe

READ_ADDRESS: 000001f2215c003f

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

EXCEPTION_CODE_STR: c0000005

EXCEPTION_PARAMETER1: 0000000000000000

EXCEPTION_PARAMETER2: 000001f2215c003f

STACK_TEXT:
000000906baff8f0 00007ff6df48bf9b : 000001f220d841c0 000000906baff9a0 0000000000000000 0000000000000000 : fluent_bit!snprintf+0x27fe3 000000906baff930 00007ff6df39896a : 0000000000000000 0000044e847301fc 000000906baffa30 00007ff6df3f66fb : fluent_bit!snprintf+0x2119b 000000906baff980 00007ff6df3f3b06 : 000001f220f847b0 000001f220dece30 000001f220dece30 0000000000000000 : fluent_bit!flb_time_now+0x14aaa 000000906baffa00 00007ff6df3f2e09 : 00000000ffffffff 000000906baffb91 000001f220d84d30 00007ff6df902798 : fluent_bit!sprintf+0x1a996 000000906baffb10 00007ff6df3a4f81 : 0000000000000000 0000000000000000 000001f21f3d57e0 0000000000000001 : fluent_bit!sprintf+0x19c99 000000906baffbf0 00007ff6df81036e : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : fluent_bit!sscanf+0x5c91 000000906baffca0 00007fff872a1aa1 : 00007ff6df3a4ec0 0000000000000000 0000000000000000 0000000000000000 : fluent_bit!mk_main+0x343ee 000000906baffcd0 00007fff8afe03fa : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : KERNELBASE!BaseFiberStart+0x21 000000906baffd00 0000000000000000 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : ntdll!RtlUserFiberStart+0x1a

STACK_COMMAND: ~4s; .ecxr ; kb

SYMBOL_NAME: fluent_bit+27fe3

MODULE_NAME: fluent_bit

IMAGE_NAME: fluent-bit.exe

FAILURE_BUCKET_ID: INVALID_POINTER_READ_c0000005_fluent-bit.exe!Unknown

OS_VERSION: 10.0.17763.1

BUILDLAB_STR: rs5_release

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

IMAGE_VERSION: 1.9.10.0

FAILURE_ID_HASH: {c553cc7a-b095-dfc3-fbc1-9d5091316b7c}

Followup: MachineOwner

pavlov2000uk commented 1 year ago

Happy New Year! Any ideas ? :)

douglasawh commented 1 year ago

Any additional information we should collect for this?

agup006 commented 1 year ago

@pavlov2000uk @douglasawh can you provide a PR with fix? Note: can you also repro with latest (2.0.8) as of this comment?

joyartoun commented 1 year ago

Hi, I am also getting this error with the latest fluentbit 2.0.8 and windows server 2019. Can you please provide a fix or at least a workaround for this?

avtar-mindgrep commented 1 year ago

I also face the same issue with fluent bit 2.0.6 and windows 2016.

pavlov2000uk commented 1 year ago

@agup006 Just a short intermediate summary. Sorry, we are the ones asking asking for a fix. I did not catch your question. We would like to get a fix. And people claim this is reproducible with 2.0.8 as well.

agup006 commented 1 year ago

@avtar-mindgrep @joyartoun are you able to provide a PR / fix?

agup006 commented 1 year ago

@pavlov2000uk got it, hoping community can help out with this - most OSS folks are focused on OTel integration right now

pavlov2000uk commented 1 year ago

@agup006, thank you very much. I have seen it crashing in 1.9.10. People, say they see it crashing in 2.0.8. I have seen it yet, testing. Since this is open source I'm not way in position to to hurry you up. However, if there is some free slot, witting a "reasonable" timeframe, would be very good if "the team" could invest some time into the issue.

pavlov2000uk commented 1 year ago

For example, fluentb 2.0.8, server 2019, 'fluentd' version '1.15.3' and this configuration

myconf.conf - fluentbit

[SERVICE] Flush 5 Log_Level Trace Daemon false Log_File C:\Program Files\fluent-bit\log\fluent-bit.log Storage.path C:\Program Files\fluent-bit\buffer Storage.sync normal Storage.checksum false Storage.backlog.mem_limit 5M

@INCLUDE C:\Program Files\fluent-bit\conf\include_dummy.conf

@INCLUDE C:\Program Files\fluent-bit\conf\include_os_metrics.conf

include_os_metrics.conf - fluentbit

[INPUT] Name windows_exporter_metrics Tag windows.osmetrics.${HOSTNAME} scrape_interval 2 [OUTPUT] Name forward Match windows.osmetrics.${HOSTNAME} Upstream C:\Program Files\fluent-bit\conf\upstream.conf storage.total_limit_size 50M

upstream.conf - fluentbit

[UPSTREAM] name myupstream [NODE] name myupstreamnode1 host 192.168.50.30 port 24224

td-agent.log - fluentd

@type forward <match windows.**> @type stdout

I'm getting the following error in td-agent.log 2023-01-16 10:34:17 -0800 [warn]: #0 emit transaction failed: error_class=TypeError error="can't convert Hash into an exact number" location="C:/opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.15.3/lib/fluent/plugin/formatter_stdout.rb:42:in at'" tag="windows.osmetrics.fluentbit208" 2023-01-16 10:34:17 -0800 [warn]: #0 suppressed same stacktrace 2023-01-16 10:34:17 -0800 [error]: #0 unexpected error on reading data host="192.168.50.122" port=49751 error_class=TypeError error="can't convert Hash into an exact number" 2023-01-16 10:34:17 -0800 [error]: #0 suppressed same stacktrace 2023-01-16 10:34:22 -0800 [warn]: #0 emit transaction failed: error_class=TypeError error="can't convert Hash into an exact number" location="C:/opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.15.3/lib/fluent/plugin/formatter_stdout.rb:42:inat'" tag="windows.osmetrics.fluentbit208" 2023-01-16 10:34:22 -0800 [warn]: #0 suppressed same stacktrace 2023-01-16 10:34:22 -0800 [error]: #0 unexpected error on reading data host="192.168.50.122" port=49752 error_class=TypeError error="can't convert Hash into an exact number" 2023-01-16 10:34:22 -0800 [error]: #0 suppressed same stacktrace

While the same configuration works in 1.9.10.

Testing it further....

pavlov2000uk commented 1 year ago

It is crashing with 2.0.8.

crashing

kubotat commented 1 year ago

@pavlov2000uk There are two @include files, include_dummy.conf and include_os_metrics in your base configuration. Could you remove one of them and run repro again? That is helpful to find out which input plugin causes the issue. Also, I suggest you use "stdout" as an output at Fluent Bit. Using "stdout" is useful to isolate the issue to check if it is caused by connection between FluentBit and Fluent or not.

pavlov2000uk commented 1 year ago

@kubotat thank you very much for the proposal. This is how I see.

"The configuration""

[SERVICE]

Interval to flush output (seconds) / Every 5 seconds for a given output section

Flush 5
#Diagnostic level 
Log_Level Trace
#If true go to background on start
Daemon false
#Optional 'parsers' config file (can be multiple)
Parsers_File parsers.conf
#File to log diagnostic output
Log_File C:\Program Files\fluent-bit\log\fluent-bit.log
#optional location in the file system to store streams and chunks of data.
Storage.path C:\Program Files\fluent-bit\buffer
#the synchronization mode used to store the data into the file system. 
Storage.sync normal
#the data integrity check when writing and reading data from the filesystem. 
Storage.checksum false.
# storage.path + this option configure a hint of maximum value of memory to use when processing these records.
Storage.backlog.mem_limit 5M

@INCLUDE C:\Program Files\fluent-bit\conf\include_dummy.conf

@INCLUDE C:\Program Files\fluent-bit\conf\include_os_metrics.conf

With "the configuration" like above, i.e. when I do not collect windows_exporter_metrics (and anyway it is useless to collect it as Fluentd cannot process it), it may take much longer time to crash. In 1.9.10 (I need to retest one more time taking into consideration the findings below) it took me 24 hours to see the crash / in 2.0.8 it crashed after 4-5 hours. If I enable both of the includes in 2.0.8, then the crash happens immediately.

@INCLUDE C:\Program Files\fluent-bit\conf\include_dummy.conf @INCLUDE C:\Program Files\fluent-bit\conf\include_os_metrics.conf

So the summary regarding fluentbit 2.0.8 and the "plugin isolation" is:

  1. Collecting and forwarding windows_exporter_metrics data makes no sense since there is an error in fluentd logs; xxx 2023-01-19 01:31:34 -0800 [warn]: #0 emit transaction failed: error_class=TypeError error="can't convert Hash into an exact number" location="C:/opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.15.3/lib/fluent/plugin/formatter_stdout.rb:42:in `at'" tag="windows.osmetrics.fluentbit208" 2023-01-19 01:31:34 -0800 [warn]: #0 suppressed same stacktrace 2023-01-19 01:31:34 -0800 [error]: #0 unexpected error on reading data host="192.168.50.122" port=65095 error_class=TypeError error="can't convert Hash into an exact number" 2023-01-19 01:31:34 -0800 [error]: #0 suppressed same stacktrace xxx
  2. When I'm collecting and forwarding the dummy logs only (nginx format), then the crash interval varies but the process will crash eventually; xxx [2023/01/19 01:49:10] [ info] Configuration: [2023/01/19 01:49:10] [ info] flush time | 5.000000 seconds [2023/01/19 01:49:10] [ info] grace | 5 seconds [2023/01/19 01:49:10] [ info] daemon | 0 [2023/01/19 01:49:10] [ info] [2023/01/19 01:49:10] [ info] inputs: [2023/01/19 01:49:10] [ info] tail [2023/01/19 01:49:10] [ info] [2023/01/19 01:49:10] [ info] filters: [2023/01/19 01:49:10] [ info] [2023/01/19 01:49:10] [ info] outputs: [2023/01/19 01:49:10] [ info] forward.0 [2023/01/19 01:49:10] [ info] [2023/01/19 01:49:10] [ info] collectors: xxx
  3. When I'm trying to collect data for both of the includes (in this situation I get 2 "forward" outputs) -> it fails instantly. xxx [2023/01/19 01:38:41] [ info] Configuration: [2023/01/19 01:38:41] [ info] flush time | 5.000000 seconds [2023/01/19 01:38:41] [ info] grace | 5 seconds [2023/01/19 01:38:41] [ info] daemon | 0 [2023/01/19 01:38:41] [ info] _ [2023/01/19 01:38:41] [ info] inputs: [2023/01/19 01:38:41] [ info] tail [2023/01/19 01:38:41] [ info] windows_exportermetrics [2023/01/19 01:38:41] [ info] [2023/01/19 01:38:41] [ info] filters: [2023/01/19 01:38:41] [ info] [2023/01/19 01:38:41] [ info] outputs: [2023/01/19 01:38:41] [ info] forward.0 [2023/01/19 01:38:41] [ info] forward.1 [2023/01/19 01:38:41] [ info] [2023/01/19 01:38:41] [ info] collectors: xxx
kubotat commented 1 year ago

@pavlov2000uk Thanks for the results. Are you able to include only include_dummy.conf and use stdout as an output instead of forward? Here is the sample config of stdout:

[INPUT]
Name tail
Tag windows.dummy.${HOSTNAME}
#Insert the log name below
Path C:\Program Files\fluent-bit\log\access_log_20221221-111926.log
Read_from_Head false
Path_Key filename
Storage.type filesystem
DB winlog.sqlite
Mem_Buf_Limit 50MB
[OUTPUT]
Name stdout
Match windows.dummy.${HOSTNAME}
pavlov2000uk commented 1 year ago

@kubotat thank you.

It crashes rarely in "tail-to-forward" only configuration.. I see it crashes immediately in "tail-to-forward" AND "windows_exporter_metrics-to-forward" scenario (there are 2 forwarders). The latter works fine, when tested in Linux. "Linux Fluent-bit" can handle 2 forwarders. I have not seen it crashing.

I'm testing right now "tail-to-stdout". Will take some time. Tomorrow I'll have an answer, if it crashes or not.

pavlov2000uk commented 1 year ago

@kubotat Server 2019 / Fluentbit 2.0.8

CONFIG: myconf.conf xxx [SERVICE] Flush 5 Log_Level Trace Daemon false Parsers_File parsers.conf Log_File C:\Program Files\fluent-bit\log\fluent-bit.log Storage.path C:\Program Files\fluent-bit\buffer Storage.sync normal Storage.checksum false. Storage.backlog.mem_limit 50M
@INCLUDE C:\Program Files\fluent-bit\conf\include_dummy.conf xxx

xxxx include_dummy.conf [INPUT] Name tail Tag windows.dummy.${HOSTNAME}

Insert the log name below

Path C:\Program Files\fluent-bit\log\dummy.log
Read_from_Head false
Path_Key filename
Storage.type filesystem
DB winlog.sqlite
Mem_Buf_Limit 50MB

[OUTPUT] Name stdout Match windows.dummy.${HOSTNAME} xxxx

RESULT:

Crashed 2 times for me. -1st time almost immediately. xxxx Faulting application name: fluent-bit.exe, version: 2.0.8.0, time stamp: 0x63a54cff Faulting module name: fluent-bit.exe, version: 2.0.8.0, time stamp: 0x63a54cff Exception code: 0xc0000005 Fault offset: 0x00000000004ec20a Faulting process id: 0x175c Faulting application start time: 0x01d92fd82a07d586 Faulting application path: C:\Program Files\fluent-bit\bin\fluent-bit.exe Faulting module path: C:\Program Files\fluent-bit\bin\fluent-bit.exe Report Id: be6e3379-35b5-43cd-bae0-98923831e1d9 Faulting package full name: Faulting package-relative application ID: xxxxx

-2nd time after several minutes. xxx Faulting application name: fluent-bit.exe, version: 2.0.8.0, time stamp: 0x63a54cff Faulting module name: fluent-bit.exe, version: 2.0.8.0, time stamp: 0x63a54cff Exception code: 0xc0000005 Fault offset: 0x0000000000529f2d Faulting process id: 0x14bc Faulting application start time: 0x01d92fd8d4b54f4f Faulting application path: C:\Program Files\fluent-bit\bin\fluent-bit.exe Faulting module path: C:\Program Files\fluent-bit\bin\fluent-bit.exe Report Id: 8bff8d76-aec3-47ec-a839-24fd07f88ac9 Faulting package full name: Faulting package-relative application ID: xxx

Running with command -> "C:\Program Files\fluent-bit\bin\fluent-bit.exe" -c "C:\Program Files\fluent-bit\conf\myconf.conf"

image

Server 2019 is Microsoft evaluation image install with Firewall disabled, i.e. very basic setup.

pavlov2000uk commented 1 year ago

Any idea which configuration can be changed to stop it from crashing? Or, if the configuration looks legit, maybe there is a chance to have it fixed within some reasonable timeframe? Thanks in advance.

pavlov2000uk commented 1 year ago

More Fluentbit tests. Simple configuration: Tail (from dummy Nginx log) -> Stdout on Windows 2019 1.8.15 -> running stable several hours. 1.9.10 -> crashes from time to time. 2.0.8 -> crashes from time to time.

Hence it is not a configuration issue 1.8.15 vs 1.9.10 vs 2.0.8 as the same configuration applies. 1.8.15 is stable. Others do crash.

Any help is appreciated.

douglasawh commented 1 year ago

no proof at the moment, but have a hunch the issue was introduced here: https://github.com/fluent/fluent-bit/commit/c9be01fe631a8563cacdf4481706dc0da9e288f3

MrTomasz commented 1 year ago

For me - it does crash when using filesystem storage for the input tail logs, however when using memory-based - it doesn't crash and works well, but other things happen (see #6748 ).

From what I've seen, after some time, processing logic hits some sort of a msgpack'ed buffer misalignment when unpacking msgpack'ed log buffers into json (I use Loki output). The read pointer got eventually misaligned (wrong parsing?) and msgpack unpacker started misaligned parsing which ended up in crash as it hit allocated buffer boundary, resulting in ACCESS_VIOLATION crash. It basically stopped hitting MSGPACK_ARRAY objects and started parsing an MSGPACK_ARRAY object internals, trying to find the MSGPACK_ARRAY itself.

I have updated msgpack to latest v5.0.0, but it didn't help.

@douglasawh I don't know yet how the commit you've mentioned (onigmo lib update) is affecting this, but I'll try reverting it, just because I ran out of options.

pavlov2000uk commented 1 year ago

@MrTomasz thank you. Tested and agree with your findings. The crash is not reproducible if filesystem buffer is not configured. And if I run a 1.9.8 test TAIL -> STDOOUT using filesystem buffer, it does not crash. However the same test with 1.9.9 does crash from time to time. Looks like the problem was introduced starting 1.9.9.

douglasawh commented 1 year ago

@MrTomasz we picked out that commit for investigation after looking at the dump. unfortunately, the dump we have is from calyptia and the pdb in the fluent-bit zip doesn't work with it. Depending on how things come tomorrow, I might be able to generate my own pdb, at which point I should be able to see if what we are seeing in that commit is a red herring or not.

MrTomasz commented 1 year ago

@douglasawh any news?

douglasawh commented 1 year ago

reproduced on 2.0.9

MrTomasz commented 1 year ago

Well, yes, that was expected as there were no updates in the affected areas for 2.0.9 as far as I can see (I was on 2.0.9-rc).

Any luck with the onigmo commit you've mentioned earlier?

douglasawh commented 1 year ago

I spoke to @agup006 and some others in slack and was told "I think we have someone to take a look in 2.1 timeframe".

I mention this here partially because I know some folks are using the github for status. Also though, I have a question. This is still listed as "waiting-for-triage". To me, the slack conversation seems a bit like a triage discussion. Is the difference here that one conversation is a Project Management discussion and the other is a development discussion? This is the assumption I have passed on, but I said I would double-check, so here I am. :)

braydonk commented 1 year ago

I have been investigating this issue as we have been observing it since updating from v1.9.8 to major version 2.

I have reason to believe the issue is resolved in the 2.0 branch (eventually to be v2.0.12). While we were investigating off v2.0.10, we happened to notice that the culprit of the access violation is the buffer address in the event chunk. This buffer address is read from the input chunk during the output flush process.

I think this commit fixes the issue: https://github.com/fluent/fluent-bit/commit/a792c10740cb79115547465e5ce0eea50037be63

@leonardo-albertovich helped me understand some of the history and In 1.9.9, chunkio was updated to use mapped files on Windows. With mapped files on Windows, sometimes locking the chunk causes the file to be remapped. Because the chunk lock used to happen after reading the buffer data, it lead to a sporadic TOCTOU bug as the buffer data address no longer has input chunk data in it (the mapping has changed). Locking the chunk before getting the buffer address resolves that issue. (Leonardo feel free to correct me if I got any info wrong).

If any folks are still affected by this, I would recommend trying to reproduce off the 2.0 branch if possible, or wait for 2.0.12 and try to reproduce then.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 5 days with no activity.