Closed Sabari-Arunkumar-ML closed 2 years ago
The stack trace looks like it has your username in it so presumably you've compiled Fluent Bit from source to get it. What options did you use, on what OS and compiler, etc.? Any other patches? It looks like the latest code is fairly different so is it possible to try a more recent version to confirm whether this is a bug that is already fixed?
Hi Patrick,
Tried with latest fluentbit v1.9 (active master branch). It's been manually complied by me. Looks like, still the bug is not yet fixed
Build Machine - Amazon Linux 2 (a centos varient). Compiler - cmake3
Related important settings used in our configuration pipeline
At service level => storage.max_chunks_up 128 storage.backlog.mem_limit 128M (tried with 50MB as well)
At Filter level => (rewrite tag - Total count of 48 filters in our pipe line) Emitter_Mem_Buf_Limit 64M Emitter_Storage.type filesystem
(Tail - Total count of 14 filters in our pipeline) Buffer_Chunk_Size 100k Buffer_Max_Size 1MB Mem_Buf_Limit 50MB Storage.Type filesystem
Following are flags+active plugins in built binary
Inputs cpu CPU Usage mem Memory Usage thermal Thermal kmsg Kernel Log Buffer proc Check Process health disk Diskstats netif Network Interface Usage docker Docker containers metrics docker_events Docker events node_exporter_metrics Node Exporter Metrics (Prometheus Compatible) fluentbit_metrics Fluent Bit internal metrics tail Tail files dummy Generate dummy data head Head Input health Check TCP server health http HTTP collectd collectd input plugin statsd StatsD input plugin nginx_metrics Nginx status metrics serial Serial input stdin Standard Input syslog Syslog exec Exec Input tcp TCP mqtt MQTT, listen for Publish messages forward Fluentd in-forward random Random
Filters alter_size Alter incoming chunk size aws Add AWS Metadata checklist Check records and flag them record_modifier modify record throttle Throttle messages using sliding window algorithm type_converter Data type converter apm_geoip Adds Geo Information //Custom plugin created for our usage apm_uaparser Adds User Agent Information //Custom plugin created for our usage apm_url_normalizer Adds Normalized URL Information //Custom plugin created for our usage kubernetes Filter to append Kubernetes metadata modify modify records by applying rules multiline Concatenate multiline messages nest nest events by specified field values parser Parse events expect Validate expected keys and values grep grep events by specified field values rewrite_tag Rewrite records tags lua Lua Scripting Filter stdout Filter events to STDOUT geoip2 add geoip information to records
Outputs azure Send events to Azure HTTP Event Collector azure_blob Azure Blob Storage bigquery Send events to BigQuery via streaming insert counter Records counter datadog Send events to DataDog HTTP Event Collector es Elasticsearch exit Exit after a number of flushes (test purposes) file Generate log file forward Forward (Fluentd protocol) http HTTP Output influxdb InfluxDB Time Series logdna LogDNA loki Loki kafka-rest Kafka REST Proxy nats NATS Server nrlogs New Relic null Throws away events plot Generate data file for GNU Plot skywalking Send logs into log collector on SkyWalking OAP slack Send events to a Slack channel splunk Send events to Splunk HTTP Event Collector stackdriver Send events to Google Stackdriver Logging stdout Prints events to STDOUT syslog Syslog tcp TCP Output td Treasure Data flowcounter FlowCounter gelf GELF Output websocket Websocket cloudwatch_logs Send logs to Amazon CloudWatch kinesis_firehose Send logs to Amazon Kinesis Firehose kinesis_streams Send logs to Amazon Kinesis Streams prometheus_exporter Prometheus Exporter prometheus_remote_write Prometheus remote write s3 Send to S3
Internal Event Loop = epoll Build Flags = FLB_HAVE_IN_STORAGE_BACKLOG FLB_HAVE_PARSER FLB_HAVE_RECORD_ACCESSOR FLB_HAVE_STREAM_PROCESSOR JSMN_PARENT_LINKS JSMN_STRICT FLB_HAVE_TLS FLB_HAVE_METRICS FLB_HAVE_AWS FLB_HAVE_AWS_CREDENTIAL_PROCESS FLB_HAVE_SIGNV4 FLB_HAVE_SQLDB FLB_HAVE_METRICS FLB_HAVE_HTTP_SERVER FLB_HAVE_FORK FLB_HAVE_GMTOFF FLB_HAVE_UNIX_SOCKET FLB_HAVE_ATTRIBUTE_ALLOC_SIZE FLB_HAVE_PROXY_GO FLB_HAVE_LIBBACKTRACE FLB_HAVE_REGEX FLB_HAVE_UTF8_ENCODER FLB_HAVE_LUAJIT FLB_HAVE_C_TLS FLB_HAVE_ACCEPT4 FLB_HAVE_INOTIFY
Error Message:-
[2022/01/17 08:37:08] [error] [filter:lua:lua.85] failed to allocate outbuf [2022/01/17 08:37:08] [error] [filter:lua:lua.171] failed to allocate outbuf
[2022/01/17 08:37:09] [engine] caught signal (SIGSEGV)
Would you mind showing your fluent-bit.conf here?
I was able to reproduce the rewrite_tag chunk_io_locked segfault with moderate log tailing load on a k8s cluster 1.21.5 (running a daemonset fluent bit 1.18.15).
My data pipeline is primarily:
input tail (all container logs) -> kubernetes filter -> rewrite_tag with re-emit
the original (non rewrite_tag) records are sent to logstash while the records that are processed by rewrite_tag ultimately are sent to elasticsearch.
NOTE: /mnt/flb-storage
is a HostPath mount scoped to each node's /tmp/ folder
fluent-bit.conf:
[SERVICE]
Log_Level warning
Parsers_File parsers.conf
storage.path /mnt/flb-storage/
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /mnt/flb-storage/flb_kube.db
Mem_Buf_Limit 100MB
Read_from_Head On
Skip_Long_Lines On
Refresh_Interval 10
storage.type filesystem
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
K8S-Logging.Parser On
Buffer_Size 0
[OUTPUT]
Name http
Match kube.*
Host ${FLUENT_LOGSTASH_HOST}
Port ${FLUENT_LOGSTASH_PORT}
Format json
[FILTER]
Name rewrite_tag
Match kube.*
Rule $kubernetes['labels']['foo'] ^(bar|baz)$ foo.$TAG true
Emitter_Name foo_entities
Emitter_Storage.type filesystem
Emitter_Mem_Buf_Limit 100MB
[OUTPUT]
Name es
Match foo.*
Host ${FLUENT_ELASTICSEARCH_HOST}
Port ${FLUENT_ELASTICSEARCH_PORT}
Index myindex
Logstash_Format False
Retry_Limit False
Replace_Dots On
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
# Command | Decoder | Field | Optional Action
# =============|=============|=======|=================
Decode_Field_As escaped_utf8 log
Stack trace and logs leading up to the error.
38
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648435921.275351581.flb, removing it from the queue
37
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648435966.459424147.flb, removing it from the queue
36
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648436017.386825391.flb, removing it from the queue
35
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648436671.633580656.flb, removing it from the queue
34
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648436807.915247861.flb, removing it from the queue
33
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648436848.848801159.flb, removing it from the queue
32
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648436900.178365313.flb, removing it from the queue
31
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648436941.118767924.flb, removing it from the queue
30
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648437673.813974810.flb, removing it from the queue
29
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648437732.835760926.flb, removing it from the queue
28
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648437765.293384250.flb, removing it from the queue
27
[2022/04/01 15:21:13] [error] [storage backlog] could not retrieve chunk tag from tail.0/1-1648437871.782773009.flb, removing it from the queue
26
[2022/04/01 15:21:27] [engine] caught signal (SIGSEGV)
25
#0 0x557806733acf in cio_chunk_is_locked() at lib/chunkio/src/cio_chunk.c:375
24
#1 0x55780645649d in input_chunk_get() at src/flb_input_chunk.c:1017
23
#2 0x557806456cd5 in flb_input_chunk_append_raw() at src/flb_input_chunk.c:1301
22
#3 0x55780647a4e3 in in_emitter_add_record() at plugins/in_emitter/emitter.c:117
21
#4 0x5578065260be in process_record() at plugins/filter_rewrite_tag/rewrite_tag.c:355
20
#5 0x557806526226 in cb_rewrite_tag_filter() at plugins/filter_rewrite_tag/rewrite_tag.c:416
19
#6 0x557806422bff in flb_filter_do() at src/flb_filter.c:125
18
#7 0x557806456fac in flb_input_chunk_append_raw() at src/flb_input_chunk.c:1366
17
#8 0x5578064817ee in ml_flush_callback() at plugins/in_tail/tail_file.c:834
16
#9 0x55780644367d in flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1016
15
#10 0x557806441e34 in package_content() at src/multiline/flb_ml.c:300
14
#11 0x5578064423bb in process_append() at src/multiline/flb_ml.c:495
13
#12 0x557806442606 in ml_append_try_parser() at src/multiline/flb_ml.c:564
12
#13 0x55780644280e in flb_ml_append() at src/multiline/flb_ml.c:630
11
#14 0x55780648099c in process_content() at plugins/in_tail/tail_file.c:431
10
#15 0x557806483097 in flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1310
9
#16 0x55780647b18c in in_tail_collect_event() at plugins/in_tail/tail.c:311
8
#17 0x55780647c36b in tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
7
#18 0x5578064226a0 in flb_input_collector_fd() at src/flb_input.c:1110
6
#19 0x5578064354b8 in flb_engine_handle_event() at src/flb_engine.c:411
5
#20 0x5578064354b8 in flb_engine_start() at src/flb_engine.c:696
4
#21 0x557806416f58 in flb_lib_worker() at src/flb_lib.c:627
3
#22 0x7f0faa46fea6 in ???() at ???:0
2
#23 0x7f0fa9d71dee in ???() at ???:0
1
#24 0xffffffffffffffff in ???() at ???:0
FWIW, disabling all of the storage.*
& Emitter_Storage.*
options in my configuration file made the segfault go away.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
Any update @patrick-stephens/ @EKwongChum ?
Not really: both versions listed in this PR are quite old now so I would encourage you to try the latest to see if the issue is resolved already. I don't believe there will be any further releases of 1.7 or 1.8.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.
Version: 1.7.5 Environment: Ubuntu (containarized) (k8s)
We have a high load in production and multiple files+rewrite tags in our pipleline. Upon new log files found over a period in k8s cluster, we will restart fluentbit (SIGTERM call followed by new process creation )
I can see fluentbit crashed sporadically,
Following is the GDB backtrace observed in one of crash dump file
0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
1 0x00007fe965090921 in __GI_abort () at abort.c:79
2 0x0000000000436a72 in flb_signal_handler (signal=11) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/fluent-bit.c:514
3
4 0x000000000072e3ee in cio_chunk_is_locked (ch=0x36) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/lib/chunkio/src/cio_chunk.c:343
5 0x0000000000478b7c in input_chunk_get (tag=0x7fe960446f30 "klog", tag_len=4, in=0x7fe9603f2a80, chunk_size=368, set_down=0x7fe96504df08) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_input_chunk.c:630
6 0x0000000000479121 in flb_input_chunk_append_raw (in=0x7fe9603f2a80, tag=0x7fe960446f30 "klog", tag_len=4, buf=0x7fe96001b9d0, buf_size=368) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_input_chunk.c:865
7 0x000000000048cf47 in in_emitter_add_record (tag=0x7fe960446680 "klog", tag_len=4, buf_data=0x7fe96624901f <error: Cannot access memory at address 0x7fe96624901f>, buf_size=368, in=0x7fe9603f2a80) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/in_emitter/emitter.c:117
8 0x0000000000523635 in process_record (tag=0x7fe960452b90 "kubelet", tag_len=7, map=..., buf=0x7fe96624901f, buf_size=368, keep=0x7fe96504e160, ctx=0x7fe9603f2270) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/filter_rewrite_tag/rewrite_tag.c:324
9 0x000000000052378b in cb_rewrite_tag_filter (data=0x7fe96624901f, bytes=368, tag=0x7fe960452b90 "kubelet", tag_len=7, out_buf=0x7fe96504e1f8, out_bytes=0x7fe96504e1e8, f_ins=0x1320a40, filter_context=0x7fe9603f2270, config=0x128e290)
at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/filter_rewrite_tag/rewrite_tag.c:375
10 0x000000000044cc0c in flb_filter_do (ic=0x7fe96043ee70, data=0x7fe960018ce0, bytes=371, tag=0x7fe96043ef00 "kubelet", tag_len=7, config=0x128e290) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_filter.c:118
11 0x00000000004792ee in flb_input_chunk_append_raw (in=0x7fe960404f50, tag=0x7fe96043ef00 "kubelet", tag_len=7, buf=0x7fe960018ce0, buf_size=371) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_input_chunk.c:911
12 0x000000000048cf47 in in_emitter_add_record (tag=0x7fe9604520c0 "kubelet", tag_len=7, buf_data=0x7fe96045c5d0 "\222\327", buf_size=371, in=0x7fe960404f50) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/in_emitter/emitter.c:117
13 0x0000000000523635 in process_record (tag=0x7fe96042ede0 "syslog", tag_len=6, map=..., buf=0x7fe96045c5d0, buf_size=371, keep=0x7fe96504e510, ctx=0x7fe960404740) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/filter_rewrite_tag/rewrite_tag.c:324
14 0x000000000052378b in cb_rewrite_tag_filter (data=0x7fe96045c5d0, bytes=371, tag=0x7fe96042ede0 "syslog", tag_len=6, out_buf=0x7fe96504e5a8, out_bytes=0x7fe96504e598, f_ins=0x1322050, filter_context=0x7fe960404740, config=0x128e290)
at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/filter_rewrite_tag/rewrite_tag.c:375
15 0x000000000044cc0c in flb_filter_do (ic=0x7fe9604519d0, data=0x7fe960013630, bytes=177, tag=0x7fe96029a320 "syslog", tag_len=6, config=0x128e290) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_filter.c:118
16 0x00000000004792ee in flb_input_chunk_append_raw (in=0x12c30e0, tag=0x7fe96029a320 "syslog", tag_len=6, buf=0x7fe960013630, buf_size=177) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_input_chunk.c:911
17 0x0000000000491487 in process_content (file=0x7fe9602a11b0, bytes=0x7fe96504e858) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/in_tail/tail_file.c:367
18 0x000000000049316b in flb_tail_file_chunk (file=0x7fe9602a11b0) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/in_tail/tail_file.c:994
19 0x000000000048d9ba in in_tail_collect_event (file=0x7fe9602a11b0, config=0x128e290) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/in_tail/tail.c:261
20 0x0000000000498277 in tail_fs_event (ins=0x12c30e0, config=0x128e290, in_context=0x7fe96029dec0) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/plugins/in_tail/tail_fs_inotify.c:268
21 0x000000000044c6cd in flb_input_collector_fd (fd=215, config=0x128e290) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_input.c:1004
22 0x000000000045c5d8 in flb_engine_handle_event (config=0x128e290, mask=1, fd=215) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_engine.c:363
23 flb_engine_start (config=0x128e290) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_engine.c:624
24 0x00000000004422db in flb_lib_worker (data=0x128e260) at /home/ec2-user/sabari/fb_v_1.7.5/fluent-bit/src/flb_lib.c:493
25 0x00007fe965e0a6db in start_thread (arg=0x7fe96504f700) at pthread_create.c:463
Note: We never got into this scenario , when we didn't use filesytem based storage.