fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.86k stars 1.59k forks source link

Filesystem based chunk storage results in "chunk_io_locked" exception followed by fluentbit process termination #4598

Closed Sabari-Arunkumar-ML closed 2 years ago

Sabari-Arunkumar-ML commented 2 years ago

Version: 1.7.5 Environment: Ubuntu (containarized) (k8s)

We have a high load in production and multiple files+rewrite tags in our pipleline. Upon new log files found over a period in k8s cluster, we will restart fluentbit (SIGTERM call followed by new process creation )

I can see fluentbit crashed sporadically,

Following is the GDB backtrace observed in one of crash dump file

0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

1 0x00007fe965090921 in __GI_abort () at abort.c:79

2 0x0000000000436a72 in flb_signal_handler (signal=11) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/fluent-bit.c:514

3

4 0x000000000072e3ee in cio_chunk_is_locked (ch=0x36) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/lib/chunkio/src/cio_chunk.c:343

5 0x0000000000478b7c in input_chunk_get (tag=0x7fe960446f30 "klog", tag_len=4, in=0x7fe9603f2a80, chunk_size=368, set_down=0x7fe96504df08) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_input_chunk.c:630

6 0x0000000000479121 in flb_input_chunk_append_raw (in=0x7fe9603f2a80, tag=0x7fe960446f30 "klog", tag_len=4, buf=0x7fe96001b9d0, buf_size=368) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_input_chunk.c:865

7 0x000000000048cf47 in in_emitter_add_record (tag=0x7fe960446680 "klog", tag_len=4, buf_data=0x7fe96624901f <error: Cannot access memory at address 0x7fe96624901f>, buf_size=368, in=0x7fe9603f2a80) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/in_emitter/emitter.c:117

8 0x0000000000523635 in process_record (tag=0x7fe960452b90 "kubelet", tag_len=7, map=..., buf=0x7fe96624901f, buf_size=368, keep=0x7fe96504e160, ctx=0x7fe9603f2270) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/filter_rewrite_tag/rewrite_tag.c:324

9 0x000000000052378b in cb_rewrite_tag_filter (data=0x7fe96624901f, bytes=368, tag=0x7fe960452b90 "kubelet", tag_len=7, out_buf=0x7fe96504e1f8, out_bytes=0x7fe96504e1e8, f_ins=0x1320a40, filter_context=0x7fe9603f2270, config=0x128e290)

at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/filter_rewrite_tag/rewrite_tag.c:375

10 0x000000000044cc0c in flb_filter_do (ic=0x7fe96043ee70, data=0x7fe960018ce0, bytes=371, tag=0x7fe96043ef00 "kubelet", tag_len=7, config=0x128e290) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_filter.c:118

11 0x00000000004792ee in flb_input_chunk_append_raw (in=0x7fe960404f50, tag=0x7fe96043ef00 "kubelet", tag_len=7, buf=0x7fe960018ce0, buf_size=371) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_input_chunk.c:911

12 0x000000000048cf47 in in_emitter_add_record (tag=0x7fe9604520c0 "kubelet", tag_len=7, buf_data=0x7fe96045c5d0 "\222\327", buf_size=371, in=0x7fe960404f50) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/in_emitter/emitter.c:117

13 0x0000000000523635 in process_record (tag=0x7fe96042ede0 "syslog", tag_len=6, map=..., buf=0x7fe96045c5d0, buf_size=371, keep=0x7fe96504e510, ctx=0x7fe960404740) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/filter_rewrite_tag/rewrite_tag.c:324

14 0x000000000052378b in cb_rewrite_tag_filter (data=0x7fe96045c5d0, bytes=371, tag=0x7fe96042ede0 "syslog", tag_len=6, out_buf=0x7fe96504e5a8, out_bytes=0x7fe96504e598, f_ins=0x1322050, filter_context=0x7fe960404740, config=0x128e290)

at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/filter_rewrite_tag/rewrite_tag.c:375

15 0x000000000044cc0c in flb_filter_do (ic=0x7fe9604519d0, data=0x7fe960013630, bytes=177, tag=0x7fe96029a320 "syslog", tag_len=6, config=0x128e290) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_filter.c:118

16 0x00000000004792ee in flb_input_chunk_append_raw (in=0x12c30e0, tag=0x7fe96029a320 "syslog", tag_len=6, buf=0x7fe960013630, buf_size=177) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_input_chunk.c:911

17 0x0000000000491487 in process_content (file=0x7fe9602a11b0, bytes=0x7fe96504e858) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/in_tail/tail_file.c:367

18 0x000000000049316b in flb_tail_file_chunk (file=0x7fe9602a11b0) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/in_tail/tail_file.c:994

19 0x000000000048d9ba in in_tail_collect_event (file=0x7fe9602a11b0, config=0x128e290) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/in_tail/tail.c:261

20 0x0000000000498277 in tail_fs_event (ins=0x12c30e0, config=0x128e290, in_context=0x7fe96029dec0) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/in_tail/tail_fs_inotify.c:268

21 0x000000000044c6cd in flb_input_collector_fd (fd=215, config=0x128e290) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_input.c:1004

22 0x000000000045c5d8 in flb_engine_handle_event (config=0x128e290, mask=1, fd=215) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_engine.c:363

23 flb_engine_start (config=0x128e290) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_engine.c:624

24 0x00000000004422db in flb_lib_worker (data=0x128e260) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_lib.c:493

25 0x00007fe965e0a6db in start_thread (arg=0x7fe96504f700) at pthread_create.c:463

Note: We never got into this scenario , when we didn't use filesytem based storage.

patrick-stephens commented 2 years ago

The stack trace looks like it has your username in it so presumably you've compiled Fluent Bit from source to get it. What options did you use, on what OS and compiler, etc.? Any other patches? It looks like the latest code is fairly different so is it possible to try a more recent version to confirm whether this is a bug that is already fixed?

Sabari-Arunkumar-ML commented 2 years ago

Hi Patrick,

Tried with latest fluentbit v1.9 (active master branch). It's been manually complied by me. Looks like, still the bug is not yet fixed

Build Machine - Amazon Linux 2 (a centos varient). Compiler - cmake3

Related important settings used in our configuration pipeline

At service level => storage.max_chunks_up 128 storage.backlog.mem_limit 128M (tried with 50MB as well)

At Filter level => (rewrite tag - Total count of 48 filters in our pipe line) Emitter_Mem_Buf_Limit 64M Emitter_Storage.type filesystem

(Tail - Total count of 14 filters in our pipeline) Buffer_Chunk_Size 100k Buffer_Max_Size 1MB Mem_Buf_Limit 50MB Storage.Type filesystem

Following are flags+active plugins in built binary

Inputs cpu CPU Usage mem Memory Usage thermal Thermal kmsg Kernel Log Buffer proc Check Process health disk Diskstats netif Network Interface Usage docker Docker containers metrics docker_events Docker events node_exporter_metrics Node Exporter Metrics (Prometheus Compatible) fluentbit_metrics Fluent Bit internal metrics tail Tail files dummy Generate dummy data head Head Input health Check TCP server health http HTTP collectd collectd input plugin statsd StatsD input plugin nginx_metrics Nginx status metrics serial Serial input stdin Standard Input syslog Syslog exec Exec Input tcp TCP mqtt MQTT, listen for Publish messages forward Fluentd in-forward random Random

Filters alter_size Alter incoming chunk size aws Add AWS Metadata checklist Check records and flag them record_modifier modify record throttle Throttle messages using sliding window algorithm type_converter Data type converter apm_geoip Adds Geo Information //Custom plugin created for our usage apm_uaparser Adds User Agent Information //Custom plugin created for our usage apm_url_normalizer Adds Normalized URL Information //Custom plugin created for our usage kubernetes Filter to append Kubernetes metadata modify modify records by applying rules multiline Concatenate multiline messages nest nest events by specified field values parser Parse events expect Validate expected keys and values grep grep events by specified field values rewrite_tag Rewrite records tags lua Lua Scripting Filter stdout Filter events to STDOUT geoip2 add geoip information to records

Outputs azure Send events to Azure HTTP Event Collector azure_blob Azure Blob Storage bigquery Send events to BigQuery via streaming insert counter Records counter datadog Send events to DataDog HTTP Event Collector es Elasticsearch exit Exit after a number of flushes (test purposes) file Generate log file forward Forward (Fluentd protocol) http HTTP Output influxdb InfluxDB Time Series logdna LogDNA loki Loki kafka-rest Kafka REST Proxy nats NATS Server nrlogs New Relic null Throws away events plot Generate data file for GNU Plot skywalking Send logs into log collector on SkyWalking OAP slack Send events to a Slack channel splunk Send events to Splunk HTTP Event Collector stackdriver Send events to Google Stackdriver Logging stdout Prints events to STDOUT syslog Syslog tcp TCP Output td Treasure Data flowcounter FlowCounter gelf GELF Output websocket Websocket cloudwatch_logs Send logs to Amazon CloudWatch kinesis_firehose Send logs to Amazon Kinesis Firehose kinesis_streams Send logs to Amazon Kinesis Streams prometheus_exporter Prometheus Exporter prometheus_remote_write Prometheus remote write s3 Send to S3

Internal Event Loop = epoll Build Flags = FLB_HAVE_IN_STORAGE_BACKLOG FLB_HAVE_PARSER FLB_HAVE_RECORD_ACCESSOR FLB_HAVE_STREAM_PROCESSOR JSMN_PARENT_LINKS JSMN_STRICT FLB_HAVE_TLS FLB_HAVE_METRICS FLB_HAVE_AWS FLB_HAVE_AWS_CREDENTIAL_PROCESS FLB_HAVE_SIGNV4 FLB_HAVE_SQLDB FLB_HAVE_METRICS FLB_HAVE_HTTP_SERVER FLB_HAVE_FORK FLB_HAVE_GMTOFF FLB_HAVE_UNIX_SOCKET FLB_HAVE_ATTRIBUTE_ALLOC_SIZE FLB_HAVE_PROXY_GO FLB_HAVE_LIBBACKTRACE FLB_HAVE_REGEX FLB_HAVE_UTF8_ENCODER FLB_HAVE_LUAJIT FLB_HAVE_C_TLS FLB_HAVE_ACCEPT4 FLB_HAVE_INOTIFY

Error Message:-

[2022/01/17 08:37:08] [error] [filter:lua:lua.85] failed to allocate outbuf [2022/01/17 08:37:08] [error] [filter:lua:lua.171] failed to allocate outbuf

[2022/01/17 08:37:09] [engine] caught signal (SIGSEGV)

0 0x79dda1 in cio_chunk_is_locked() at lib/chunkio/src/cio_chunk.c:375

1 0x48aa2c in input_chunk_get() at src/flb_input_chunk.c:1124

2 0x48b25c in input_chunk_append_raw() at src/flb_input_chunk.c:1409

3 0x48b84b in flb_input_chunk_append_raw() at src/flb_input_chunk.c:1593

4 0x4ab4de in in_emitter_add_record() at plugins/in_emitter/emitter.c:117

5 0x55d281 in process_record() at plugins/filter_rewrite_tag/rewrite_tag.c:352

6 0x55d415 in cb_rewrite_tag_filter() at plugins/filter_rewrite_tag/rewrite_tag.c:411

7 0x457097 in flb_filter_do() at src/flb_filter.c:124

8 0x48b544 in input_chunk_append_raw() at src/flb_input_chunk.c:1479

9 0x48b84b in flb_input_chunk_append_raw() at src/flb_input_chunk.c:1593

10 0x4aed2f in flb_tail_mult_process_first() at plugins/in_tail/tail_multiline.c:166

11 0x4af197 in flb_tail_mult_process_content() at plugins/in_tail/tail_multiline.c:310

12 0x4b1780 in process_content() at plugins/in_tail/tail_file.c:437

13 0x4b39d2 in flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1213

14 0x4ac028 in in_tail_collect_event() at plugins/in_tail/tail.c:261

15 0x4ad1e5 in tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268

16 0x456b83 in flb_input_collector_fd() at src/flb_input.c:1102

17 0x4694eb in flb_engine_handle_event() at src/flb_engine.c:413

18 0x4694eb in flb_engine_start() at src/flb_engine.c:705

19 0x44b770 in flb_lib_worker() at src/flb_lib.c:627

20 0x7f289459b6da in ???() at ???:0

21 0x7f289390271e in ???() at ???:0

22 0xffffffffffffffff in ???() at ???:0

EKwongChum commented 2 years ago

Would you mind showing your fluent-bit.conf here?

hrivera-ntap commented 2 years ago

I was able to reproduce the rewrite_tag chunk_io_locked segfault with moderate log tailing load on a k8s cluster 1.21.5 (running a daemonset fluent bit 1.18.15).

My data pipeline is primarily:

input tail (all container logs) -> kubernetes filter -> rewrite_tag with re-emit

the original (non rewrite_tag) records are sent to logstash while the records that are processed by rewrite_tag ultimately are sent to elasticsearch.

NOTE: /mnt/flb-storage is a HostPath mount scoped to each node's /tmp/ folder

  fluent-bit.conf:
    [SERVICE]
        Log_Level     warning
        Parsers_File  parsers.conf
        storage.path  /mnt/flb-storage/
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_PORT     2020

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /mnt/flb-storage/flb_kube.db
        Mem_Buf_Limit     100MB
        Read_from_Head    On
        Skip_Long_Lines   On
        Refresh_Interval  10
        storage.type      filesystem

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Merge_Log           On
        K8S-Logging.Parser  On
        Buffer_Size         0

    [OUTPUT]
        Name            http
        Match           kube.*
        Host            ${FLUENT_LOGSTASH_HOST}
        Port            ${FLUENT_LOGSTASH_PORT}
        Format          json

    [FILTER]
        Name                  rewrite_tag
        Match                 kube.*
        Rule                  $kubernetes['labels']['foo'] ^(bar|baz)$ foo.$TAG true
        Emitter_Name          foo_entities
        Emitter_Storage.type  filesystem
        Emitter_Mem_Buf_Limit 100MB

   [OUTPUT]
        Name               es
        Match              foo.*
        Host               ${FLUENT_ELASTICSEARCH_HOST}
        Port               ${FLUENT_ELASTICSEARCH_PORT}
        Index              myindex
        Logstash_Format    False
        Retry_Limit        False
        Replace_Dots       On

  parsers.conf: |
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On
        # Command      |  Decoder    | Field | Optional Action
        # =============|=============|=======|=================
        Decode_Field_As escaped_utf8  log

Stack trace and logs leading up to the error.

38
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648435921.275351581.flb, removing it from the queue
37
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648435966.459424147.flb, removing it from the queue
36
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648436017.386825391.flb, removing it from the queue
35
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648436671.633580656.flb, removing it from the queue
34
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648436807.915247861.flb, removing it from the queue
33
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648436848.848801159.flb, removing it from the queue
32
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648436900.178365313.flb, removing it from the queue
31
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648436941.118767924.flb, removing it from the queue
30
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648437673.813974810.flb, removing it from the queue
29
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648437732.835760926.flb, removing it from the queue
28
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648437765.293384250.flb, removing it from the queue
27
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648437871.782773009.flb, removing it from the queue
26
[2022/04/01 15:21:27] [engine] caught signal (SIGSEGV)
25
#0  0x557806733acf      in  cio_chunk_is_locked() at lib/chunkio/src/cio_chunk.c:375
24
#1  0x55780645649d      in  input_chunk_get() at src/flb_input_chunk.c:1017
23
#2  0x557806456cd5      in  flb_input_chunk_append_raw() at src/flb_input_chunk.c:1301
22
#3  0x55780647a4e3      in  in_emitter_add_record() at plugins/in_emitter/emitter.c:117
21
#4  0x5578065260be      in  process_record() at plugins/filter_rewrite_tag/rewrite_tag.c:355
20
#5  0x557806526226      in  cb_rewrite_tag_filter() at plugins/filter_rewrite_tag/rewrite_tag.c:416
19
#6  0x557806422bff      in  flb_filter_do() at src/flb_filter.c:125
18
#7  0x557806456fac      in  flb_input_chunk_append_raw() at src/flb_input_chunk.c:1366
17
#8  0x5578064817ee      in  ml_flush_callback() at plugins/in_tail/tail_file.c:834
16
#9  0x55780644367d      in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1016
15
#10 0x557806441e34      in  package_content() at src/multiline/flb_ml.c:300
14
#11 0x5578064423bb      in  process_append() at src/multiline/flb_ml.c:495
13
#12 0x557806442606      in  ml_append_try_parser() at src/multiline/flb_ml.c:564
12
#13 0x55780644280e      in  flb_ml_append() at src/multiline/flb_ml.c:630
11
#14 0x55780648099c      in  process_content() at plugins/in_tail/tail_file.c:431
10
#15 0x557806483097      in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1310
9
#16 0x55780647b18c      in  in_tail_collect_event() at plugins/in_tail/tail.c:311
8
#17 0x55780647c36b      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
7
#18 0x5578064226a0      in  flb_input_collector_fd() at src/flb_input.c:1110
6
#19 0x5578064354b8      in  flb_engine_handle_event() at src/flb_engine.c:411
5
#20 0x5578064354b8      in  flb_engine_start() at src/flb_engine.c:696
4
#21 0x557806416f58      in  flb_lib_worker() at src/flb_lib.c:627
3
#22 0x7f0faa46fea6      in  ???() at ???:0
2
#23 0x7f0fa9d71dee      in  ???() at ???:0
1
#24 0xffffffffffffffff  in  ???() at ???:0
hrivera-ntap commented 2 years ago

FWIW, disabling all of the storage.* & Emitter_Storage.* options in my configuration file made the segfault go away.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Sabari-Arunkumar-ML commented 2 years ago

Any update @patrick-stephens/ @EKwongChum ?

patrick-stephens commented 2 years ago

Not really: both versions listed in this PR are quite old now so I would encourage you to try the latest to see if the issue is resolved already. I don't believe there will be any further releases of 1.7 or 1.8.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 2 years ago

This issue was closed because it has been stalled for 5 days with no activity.