[engine] caught signal (SIGSEGV) on 2 pods in the daemonset

Bug Report

Describe the bug Over time I have observed that our fluent bit pods running 1.8.4 occasionally show pod restarts due to termination (kubernetes exit code 139). Interestingly, the restarts seem to only happen on a few pods in the daemonset:

NAME               READY   STATUS    RESTARTS   AGE
fluent-bit-2cxmz   1/1     Running   0          7d7h
fluent-bit-5sxv7   1/1     Running   0          7d8h
fluent-bit-5twrh   1/1     Running   0          7d8h
fluent-bit-6dcqx   1/1     Running   0          7d8h
fluent-bit-6w8hw   1/1     Running   4          6d21h
fluent-bit-d5487   1/1     Running   0          7d7h
fluent-bit-fv2m6   1/1     Running   0          6d23h
fluent-bit-fxgs8   1/1     Running   0          7d8h
fluent-bit-kqd2n   1/1     Running   0          7d8h
fluent-bit-kw5hq   1/1     Running   0          7d8h
fluent-bit-kwsxl   1/1     Running   0          7d7h
fluent-bit-l9gsr   1/1     Running   0          7d8h
fluent-bit-lxkdx   1/1     Running   12         7d11h
fluent-bit-rqdnc   1/1     Running   0          7d8h
fluent-bit-z2w48   1/1     Running   0          7d8h

To Reproduce

Example log message if applicable:

"[2021/10/14 17:23:17]inotify_fs_remove(): inode=528908 watch_fd=6\n","stream":"stderr","time":"2021-10-14T23:23:17.829964755Z
...then nothing, then
"[2021/10/14 23:46:09] [engine] caught signal (SIGSEGV) "
#0  0x55f2ca5dd1c2      in  atomic_load_p() at lib/jemalloc-5.2.1/include/jemalloc/internal/atomic.h:62
#1  0x55f2ca5dd1c2      in  extent_arena_get() at lib/jemalloc-5.2.1/include/jemalloc/internal/extent_inlines.h:51
#2  0x55f2ca5dd1c2      in  je_large_dalloc() at lib/jemalloc-5.2.1/src/large.c:361
#3  0x55f2ca59efcc      in  arena_dalloc_large() at lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:281
#4  0x55f2ca59efcc      in  arena_dalloc() at lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:323
#5  0x55f2ca59efcc      in  idalloctm() at lib/jemalloc-5.2.1/include/jemalloc/internal/jemalloc_internal_inlines_c.h:118
#6  0x55f2ca59efcc      in  ifree() at lib/jemalloc-5.2.1/src/jemalloc.c:2589
#7  0x55f2ca59efcc      in  je_free_default() at lib/jemalloc-5.2.1/src/jemalloc.c:2799
#8  0x55f2ca64f100      in  flb_free() at include/fluent-bit/flb_mem.h:122
#9  0x55f2ca651806      in  flb_http_client_destroy() at src/flb_http_client.c:1325
#10 0x55f2ca6d1a0c      in  cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:2219
#11 0x55f2ca61d19d      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:509
#12 0x55f2cab15786      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#13 0x7f258a06ebc0      in  ???() at ???:0

Steps to reproduce the problem:

There doesn't seem to be anything specific that we do other than just running for a period of time.

Expected behavior I suppose in an ideal world this error would not happen.

Your Environment

Version used: 1.8.4
Configuration:
Environment name and version (e.g. Kubernetes? What version?): Kubernetes GKE 1.19.13-gke.1200
Server type and version: GKE
Operating System and version: Container-Optimized OS with Docker (cos)
Filters and plugins: GKE

Additional context

It doesn't seem to affect us other than tripping our restart monitors. We're not sure why it's restarting (eg: is it something we're doing?)

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

This issue was closed because it has been stalled for 5 days with no activity.

I got a similar issue:

[2021/11/14 18:32:22] [engine] caught signal (SIGSEGV)
#0  0x55585d8f2fa1      in  destroy_chunk_list() at lib/msgpack-c/src/zone.c:39
#1  0x55585d8f336f      in  msgpack_zone_destroy() at lib/msgpack-c/src/zone.c:175
#2  0x55585d8f3492      in  msgpack_zone_free() at lib/msgpack-c/src/zone.c:220
#3  0x55585d8f0ac2      in  msgpack_unpacked_destroy() at lib/msgpack-c/include/msgpack/unpack.h:260
#4  0x55585d8f2e02      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:661
#5  0x55585d4a1392      in  elasticsearch_format() at plugins/out_es/es.c:358
#6  0x55585d4a20f2      in  cb_es_flush() at plugins/out_es/es.c:772
#7  0x55585d41427c      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#8  0x55585d911046      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0x7ff320a13234      in  ???() at ???:0

[2021/11/16 23:07:04] [debug] [input:tail:tail.1] 0 new files found on path '/var/log/syslog'
[2021/11/16 23:07:04] [debug] [input:tail:tail.0] inode=40894548 events: IN_MODIFY 
[2021/11/16 23:07:05] [debug] [input:tail:tail.0] inode=40894626 events: IN_MODIFY 
[2021/11/16 23:07:05] [debug] [output:es:es.0] HTTP Status=400 URI=/_bulk
[2021/11/16 23:07:05] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Malformed action/metadata line [3], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]"}],"type":"illegal_argument_exception","reason":"Malformed action/metadata line [3], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]"},"status":400}

[2021/11/16 23:07:05] [debug] [upstream] KA connection #120 to elasticsearch-master:9200 is now available
[2021/11/16 23:07:05] [debug] [input:tail:tail.0] inode=40894548 events: IN_MODIFY 
[2021/11/16 23:07:05] [debug] [http_client] not using http_proxy for header
[2021/11/16 23:07:05] [debug] [http_client] not using http_proxy for header
[2021/11/16 23:07:06] [debug] [out_es] converted_size is 0
[2021/11/16 23:07:06] [engine] caught signal (SIGSEGV)
#0  0x7f75158f9701      in  ???() at ???:0

[2021/11/25 11:16:36] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"}],"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"},"status":400}

[2021/11/25 11:16:36] [ warn] [engine] failed to flush chunk '1-1637838990.757540396.flb', retry in 6 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
[2021/11/25 11:16:47] [engine] caught signal (SIGSEGV)
#0  0x55f0c1dd9861      in  cio_chunk_is_up() at lib/chunkio/src/cio_chunk.c:474
#1  0x55f0c1b047af      in  flb_input_chunk_flush() at src/flb_input_chunk.c:1456
#2  0x55f0c1ae3d4b      in  flb_engine_dispatch() at src/flb_engine_dispatch.c:243
#3  0x55f0c1ae1e25      in  flb_engine_flush() at src/flb_engine.c:115
#4  0x55f0c1ae31d1      in  flb_engine_handle_event() at src/flb_engine.c:396
#5  0x55f0c1ae31d1      in  flb_engine_start() at src/flb_engine.c:696
#6  0x55f0c1ac581e      in  flb_lib_worker() at src/flb_lib.c:627
#7  0x7f10f3fbefa2      in  ???() at ???:0
#8  0x7f10f369f4ce      in  ???() at ???:0
#9  0xffffffffffffffff  in  ???() at ???:0

Is it possible a parsing issue is creating a segfault ?

"Is it possible a parsing issue is creating a segfault ?"

I think so. I am currently experimenting with adding "more and more" functionality back into my parser.

Here's the parser line that I was using when I see the segfaults:

    [PARSER]
        Name        elasticsearch_multiline
        Format      regex
        Regex       ^*\\"type\\": \\"(?<type>server)\\", \\"timestamp\\": \\"(?<time>[^\\]*)\\", \\"level\\": \\"(?<severity>[^\\]*)\\", \\"component\\": \\"(?<component>[^\\]*)\\", \\"cluster.name\\": \\"(?<cluster_name>[^\\]*)\\", \\"node.name\\": \\"(?<node_name>[^\\]*)\\", \\"message\\": \\"(?<message>[^\\]*)\\", \\"cluster.uuid\\": \\"(?<cluster_uuid>[^\\]*)\\", \\"node.id\\": \\"(?<node_id>[^\\]*)\\"[,]?[\s]?(?<stacktrace>[^\`]*$)
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S,%L%z

I did a test by replacing elastic-search output by simple stdout and I don't have segfault anymore. elastic-search output:

    [OUTPUT]
        Name            es
        Match           *
        Host            elasticsearch-master
        Port            9200
        tls             Off
        tls.verify      Off
        Generate_ID     On
        Logstash_Format On
        Time_Key_Format %Y-%m-%dT%H:%M:%S
        Replace_Dots    On
        Time_Key        ts
        Retry_Limit     False
        Buffer_Size     64KB

stdout used:

    [OUTPUT]
        Name            stdout
        Match           *

@bstolz can you try on your side with stdout to confirm it could be an issue with elastic-search exporter ?

Ahh you're using the elasticsearch exporter - I am exporting to stackdriver and getting the same. I can try standard out on a non-production environment.

I upgraded to 1.8.10:

works with stdout
fails with es

I have

ogs fluent-bit-xjmdk
Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

Invalid configuration value at FILTER.Name
[2021/11/30 09:07:34] [ info] [engine] started (pid=1)
[2021/11/30 09:07:34] [ info] [storage] version=1.1.5, initializing...
[2021/11/30 09:07:34] [ info] [storage] in-memory
[2021/11/30 09:07:34] [ info] [storage] full synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/30 09:07:34] [ info] [cmetrics] version=0.2.2
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] multiline core started
[2021/11/30 09:07:34] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2021/11/30 09:07:34] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2021/11/30 09:07:34] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2021/11/30 09:07:34] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2021/11/30 09:07:34] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/11/30 09:07:34] [ info] [sp] stream processor started
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224257 watch_fd=3 name=/var/log/containers/cilium-mlvm6_kube-system_cilium-agent-bdfb18631cb184400d7045bcf03e01eb592c8737af9963dead8eee41c6cb3aea.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224846 watch_fd=4 name=/var/log/containers/cilium-mlvm6_kube-system_cilium-agent-ee021432c3c584782aba98c77d8d6d90fca3be851e42403b68fb9911926829c1.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224845 watch_fd=5 name=/var/log/containers/cilium-mlvm6_kube-system_clean-cilium-state-2ad64fcca13c2b12be4b85b06118e09f82dfe372ec39dd02cd214be83951ccdc.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224823 watch_fd=6 name=/var/log/containers/cilium-mlvm6_kube-system_mount-cgroup-7e466ef8836bece0943fa6a65d9b2c9426f23c386b779cf3f4325ab7bc81ca02.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224838 watch_fd=7 name=/var/log/containers/cilium-mlvm6_kube-system_wait-for-node-init-2c1a981f06c37c999741ac23a609afdde95f05d80880d296a2935a34f4044887.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18220129 watch_fd=8 name=/var/log/containers/cilium-node-init-5cm7k_kube-system_node-init-26156fdfe5ef2b8a28e4f497bf5ae796c9fd54847979f4ceec5e7842158e13f3.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224480 watch_fd=9 name=/var/log/containers/cilium-node-init-5cm7k_kube-system_node-init-d760121abf51c6dde80769dddf2c54dee2342588fbc83974ea7a0ce2c1a0812c.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18354769 watch_fd=11 name=/var/log/containers/fluent-bit-xjmdk_monitoring_fluent-bit-b731b932d1f4455c1bc1267337a799e3d09a6388774c27362ce1d0281ffc2338.log
...
2021/11/30 09:07:41] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18354768 watch_fd=47 name=/var/log/containers/fluent-bit-xjmdk_monitoring_fluent-bit-f004961b80d7ac0bc5a58506270837eb5e34350abd0250478afee6925727283d.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.1] inotify_fs_add(): inode=18089658 watch_fd=1 name=/var/log/syslog
[2021/11/30 09:07:41] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:41] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:43] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:43] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:08:38] [engine] caught signal (SIGSEGV)
#0  0x5566019fcb51      in  destroy_chunk_list() at lib/msgpack-c/src/zone.c:39
#1  0x5566019fcf1f      in  msgpack_zone_destroy() at lib/msgpack-c/src/zone.c:175
#2  0x5566019fd042      in  msgpack_zone_free() at lib/msgpack-c/src/zone.c:220
#3  0x5566019fa672      in  msgpack_unpacked_destroy() at lib/msgpack-c/include/msgpack/unpack.h:260
#4  0x5566019fc9b2      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:661
#5  0x5566015aaa88      in  elasticsearch_format() at plugins/out_es/es.c:358
#6  0x5566015ab7e8      in  cb_es_flush() at plugins/out_es/es.c:772
#7  0x55660151d28e      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#8  0x556601a1ac06      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0x7ffbf3c13234      in  ???() at ???:0

First time I saw this line Invalid configuration value at FILTER.Name lol but I think it is not related to the issue

UPDATED: Issue fixed for Invalid configuration value at FILTER.Name and I still have the sigfault:

[2021/11/30 09:54:38] [engine] caught signal (SIGSEGV)
#0  0x7f248b8e0820      in  ???() at ???:0
#1  0x5571447fc2cd      in  msgpack_sbuffer_write() at lib/msgpack-c/include/msgpack/sbuffer.h:88
#2  0x5571447fc09f      in  msgpack_pack_str_body() at lib/msgpack-c/include/msgpack/pack_template.h:784
#3  0x5571447fe0cb      in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:981
#4  0x5571447fc5d1      in  flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#5  0x5571447fd3bd      in  flb_ml_append() at src/multiline/flb_ml.c:617
#6  0x55714483a4c3      in  process_content() at plugins/in_tail/tail_file.c:375
#7  0x55714483c9f1      in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1210
#8  0x557144834ea0      in  in_tail_collect_event() at plugins/in_tail/tail.c:261
#9  0x557144836079      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#10 0x5571447ddc16      in  flb_input_collector_fd() at src/flb_input.c:1108
#11 0x5571447f03fa      in  flb_engine_handle_event() at src/flb_engine.c:411
#12 0x5571447f03fa      in  flb_engine_start() at src/flb_engine.c:696
#13 0x5571447d281e      in  flb_lib_worker() at src/flb_lib.c:627
#14 0x7f248c19cfa2      in  ???() at ???:0
#15 0x7f248b87d4ce      in  ???() at ???:0
#16 0xffffffffffffffff  in  ???() at ???:0

As mentioned previously, I am logging to stackdriver and upon changing it to stdout I still see the same issue. I'm slowly removing some of the "extra" lines I have in the input block - multiline, skip long lines, and reduction of the buffer_max_size to see how it goes. Then I will set all of the buffers to default. After that, I will start investigating the regex.

thanks for the info, would you please provide a simple repro case so we can troubleshoot ?

I am observing this running on all of our elasticsearch 7 clusters (ECK) in GKE - it happens a few times in the period of 24hrs. I am using the same configmap for other clusters not running elasticsearch and I do not see the issue.

I am not sure how to provide a "simple" repro case given an elasticsearch cluster is involved. I've included the relevant bits of the configmap. Let me know how else I can provide what you need? I can also extend an offer to help as I do have a demo environment.

[INPUT]
    Name              tail
    Path              /var/log/containers/*_elasticsearch*.log
    Exclude_Path      /var/log/containers/*_kube-system_*.log,/var/log/containers/*_istio-system_*.log,/var/log/containers/*_knative-serving_*.log,/var/log/containers/*_gke-system_*.log,/var/log/containers/*_config-management-system_*.log,*_istio-proxy*.log,/var/log/containers/*_dd-agent*.log, /var/log/containers/*_fluent-bit-custom*.log
    Tag               kube.elasticsearch.*
    DB                /var/log/flb_elasticsearch.db
    Read_from_head    true
    Multiline         off
    Parser_Firstline  elasticsearch_multiline
    Mem_Buf_Limit     20MB
    Refresh_Interval  10
    Buffer_Chunk_Size 256k
    Buffer_Max_Size   1024k

[FILTER]
    Name         parser
    Match        kube.*
    Key_Name     log
    Reserve_Data True
    Parser       docker

[FILTER]
    Name        modify
    Match       *
    Hard_rename log message

[FILTER]
    Name         parser
    Match        kube.*
    Key_Name     message
    Reserve_Data True
    Parser       glog
    Parser       json

[OUTPUT]
    Name    stackdriver
    Match   kube.*
    resource  k8s_container
    k8s_cluster_name  <redacted>
    k8s_cluster_location  <redacted>
    severity_key severity

[PARSER]
    Name        elasticsearch_multiline
    Format      regex
    Regex       ^*\\"type\\": \\"(?<type>server)\\", \\"timestamp\\": \\"(?<time>[^\\]*)\\", \\"level\\": \\"(?<severity>[^\\]*)\\", \\"component\\": \\"(?<component>[^\\]*)\\", \\"cluster.name\\": \\"(?<cluster_name>[^\\]*)\\", \\"node.name\\": \\"(?<node_name>[^\\]*)\\", \\"message\\": \\"(?<message>[^\\]*)\\", \\"cluster.uuid\\": \\"(?<cluster_uuid>[^\\]*)\\", \\"node.id\\": \\"(?<node_id>[^\\]*)\\"[,]?[\s]?(?<stacktrace>[^\`]*$)
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S,%L%z

[PARSER]
    Name   json
    Format json
    Time_Key time
    Time_Format %d/%b/%Y:%H:%M:%S %z

[PARSER]
    Name        glog
    Format      regex
    Regex       ^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source_file>[^ \]]+)\:(?<source_line>\d+)\]\s(?<message>.*)$
    Time_Key    time
    Time_Format %m%d %H:%M:%S.%L%z

I did a new install of the latest chart for elasticsearch (docker image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0)

I clean my configuration to keep less parser/filter and it still crashs with this configuration:

    [SERVICE]
        Config_Watch  On

        # Flush
        # =====
        # set an interval of seconds before to flush records to a destination
        flush        1

        # Daemon
        # ======
        # instruct Fluent Bit to run in foreground or background mode.
        daemon       Off

        # Log_Level
        # =========
        # Set the verbosity level of the service, values can be:
        #
        # - error
        # - warning
        # - info
        # - debug
        # - trace
        #
        # by default 'info' is set, that means it includes 'error' and 'warning'.
        log_level    info

        # Parsers File
        # ============
        # specify an optional 'Parsers' configuration file
        parsers_file parsers.conf
        parsers_file  custom_parsers.conf

        # HTTP Server
        # ===========
        # Enable/Disable the built-in HTTP Server for metrics
        http_server  On
        http_listen  0.0.0.0
        http_port    2020

        # Storage
        # =======
        storage.sync      full
        storage.checksum  on

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        # Except fluent-bit to avoid feedback loop
        Exclude_Path      ["*fluent-bit*.log"]
        DB                /var/log/fluentbit/flb_kube.db
        Mem_Buf_Limit     64MB
        Skip_Long_Lines   On
        Refresh_Interval  10
        # Parser for containerd (Container Runtime Interface)
        multiline.parser  cri

    [INPUT]
        Name              tail
        Tag               system.*
        Path              /var/log/syslog
        Parser            rsyslog
        DB                /var/log/fluentbit/flb_syslog.db
        Mem_Buf_Limit     64MB
        Skip_Long_Lines   On
        Refresh_Interval  10

    [FILTER]
        Name           kubernetes
        Match          kube.*
        Kube_URL       https://kubernetes.default.svc:443
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude Off

    [OUTPUT]
        Name            es
        Match           *
        Host            elasticsearch-master
        Port            9200
        tls             Off
        tls.verify      Off
        Generate_ID     On
        Logstash_Format On
        Time_Key_Format %Y-%m-%dT%H:%M:%S
        Replace_Dots    On
        Time_Key        ts
        Retry_Limit     False
        Buffer_Size     64KB

    [PARSER]
        Name        rsyslog
        Format      regex
        Regex       (?<syslog_time>^[A-Za-z]+ [ 1-3][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2}) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<log>.*)$
        Time_Key    syslog_time
        Time_Format %b %d %H:%M:%S
        Time_Keep   On

log:

Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/11/30 16:56:31] [ info] [engine] started (pid=1)
[2021/11/30 16:56:31] [ info] [storage] version=1.1.5, initializing...
[2021/11/30 16:56:31] [ info] [storage] in-memory
[2021/11/30 16:56:31] [ info] [storage] full synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/30 16:56:31] [ info] [cmetrics] version=0.2.2
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] multiline core started
[2021/11/30 16:56:31] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2021/11/30 16:56:31] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2021/11/30 16:56:31] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2021/11/30 16:56:31] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2021/11/30 16:56:31] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/11/30 16:56:31] [ info] [sp] stream processor started
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885569 watch_fd=2 name=/var/log/containers/cilium-kvwhq_kube-system_cilium-agent-9f584588e4be9ae6ee66cfa45ad7458d2d891b6fb680d2f18bd2910f62bd142d.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885566 watch_fd=3 name=/var/log/containers/cilium-kvwhq_kube-system_clean-cilium-state-4949ccf208b74e3c36949068cec5bf4a8f22308e6ddf32a146294ee0940c2fe1.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885535 watch_fd=4 name=/var/log/containers/cilium-kvwhq_kube-system_mount-cgroup-b2939032ba97cc068dce2ac616d6a4929faff24f7304231ce62a4a1a5c24c1b3.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885563 watch_fd=5 name=/var/log/containers/cilium-kvwhq_kube-system_wait-for-node-init-c5d20e3c171339e307120e69fa64baa706b4d83a1514470666ad184ae1542ad8.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885560 watch_fd=6 name=/var/log/containers/cilium-node-init-2v4n6_kube-system_node-init-d5fab00554f962d3cea541308f6358e850638d0643f923ef5347c8c28a92c757.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885364 watch_fd=7 name=/var/log/containers/fluent-bit-9h2rs_monitoring_fluent-bit-ed4d4dc0f608994a6bcb613bc49a121b2fd74d4aebb5cace4fc6f3f316b6a660.log
...
[2021/11/30 16:56:31] [ info] [input:tail:tail.1] inotify_fs_add(): inode=56885366 watch_fd=1 name=/var/log/syslog
[2021/11/30 16:58:24] [engine] caught signal (SIGSEGV)
#0  0x55ff5827cb51      in  destroy_chunk_list() at lib/msgpack-c/src/zone.c:39
#1  0x55ff5827cf1f      in  msgpack_zone_destroy() at lib/msgpack-c/src/zone.c:175
#2  0x55ff5827d042      in  msgpack_zone_free() at lib/msgpack-c/src/zone.c:220
#3  0x55ff5827a672      in  msgpack_unpacked_destroy() at lib/msgpack-c/include/msgpack/unpack.h:260
#4  0x55ff5827c9b2      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:661
#5  0x55ff57e2aa88      in  elasticsearch_format() at plugins/out_es/es.c:358
#6  0x55ff57e2b7e8      in  cb_es_flush() at plugins/out_es/es.c:772
#7  0x55ff57d9d28e      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#8  0x55ff5829ac06      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0x7f19e864c04d      in  ???() at ???:0

Another log after the pod crashed:

[2021/11/30 17:02:06] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"}],"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"},"status":400}

[2021/11/30 17:02:06] [ warn] [engine] failed to flush chunk '\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u00�*ec', retry in 9 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:07] [ warn] [engine] failed to flush chunk '1-1638291726.807180974.flb', retry in 11 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:17] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"}],"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"},"status":400}
[2021/11/30 17:02:17] [ warn] [engine] failed to flush chunk '00\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u
', retry in 16 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:18] [ warn] [engine] failed to flush chunk '1-1638291726.807180974.flb', retry in 17 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:36] [ warn] [engine] failed to flush chunk '1-1638291726.807180974.flb', retry in 8 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:43] [ warn] [engine] failed to flush chunk '1-1638291726.807180974.flb', retry in 58 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:03:22] [engine] caught signal (SIGSEGV)
#0  0x7f6316a0c7b0      in  ???() at ???:0
#1  0x55600e56093a      in  flb_sds_cat() at src/flb_sds.c:132
#2  0x55600e5609a4      in  flb_sds_cat_safe() at src/flb_sds.c:145
#3  0x55600e583b82      in  package_content() at src/multiline/flb_ml.c:324
#4  0x55600e584030      in  process_append() at src/multiline/flb_ml.c:495
#5  0x55600e58424e      in  ml_append_try_parser() at src/multiline/flb_ml.c:564
#6  0x55600e584456      in  flb_ml_append() at src/multiline/flb_ml.c:630
#7  0x55600e5c14c3      in  process_content() at plugins/in_tail/tail_file.c:375
#8  0x55600e5c39f1      in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1210
#9  0x55600e5bbea0      in  in_tail_collect_event() at plugins/in_tail/tail.c:261
#10 0x55600e5bd079      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#11 0x55600e564c16      in  flb_input_collector_fd() at src/flb_input.c:1108
#12 0x55600e5773fa      in  flb_engine_handle_event() at src/flb_engine.c:411
#13 0x55600e5773fa      in  flb_engine_start() at src/flb_engine.c:696
#14 0x55600e55981e      in  flb_lib_worker() at src/flb_lib.c:627
#15 0x7f63172c8fa2      in  ???() at ???:0
#16 0x7f63169a94ce      in  ???() at ???:0
#17 0xffffffffffffffff  in  ???() at ???:0

Full log with trace enabled: fluent-bit-9h2rs.log.zip

I saw the CPU usage is high, I could allow more CPU or remove checksum and full sync.

I added

        storage.sync      full
        storage.checksum  on

because I read an issue where it recommanded to enable it when there is too much warning log with [ warn] [engine] failed to flush chunk '1-1638291872.61468128.flb', retry in 11 seconds: task_id=1, input=tail.0 > output=es.0 (out_id=0)

UPDATE: remove storage.sync full & storage.checksum on and increase cpu to 2 (reach 1 cpu usage): no change for the segfault

I run valgrind on a debug build of master branch (9ccc6f89171d6792f9605b77b4efd27bc28f472a) valgrind-fluent-bit-w67h2-2.log And another with option --leak-check=full: valgrind-fluent-bit-w67h2-3.log

They crash with the same stack:

#0  0x483c933           in  ???() at ???:0
#1  0x2b0116            in  es_bulk_append() at plugins/out_es/es_bulk.c:105
#2  0x2ada44            in  elasticsearch_format() at plugins/out_es/es.c:537
#3  0x2ae828            in  cb_es_flush() at plugins/out_es/es.c:772
#4  0x21f072            in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:511
#5  0x71c0c6            in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

It seems output_pre_cb_flush() is present in all SEGV stack we got with @bstolz (https://github.com/fluent/fluent-bit/blob/master/include/fluent-bit/flb_output.h#L511)

@bstolz I notice you use

    Parser_Firstline  elasticsearch_multiline

and I am using:

        multiline.parser  cri

I think there is an issue in the flushing part for multiline parser. With gdb, I got this backtrace:

#0  __memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:384
#1  0x0000555555688943 in msgpack_sbuffer_write (data=0x7ffff61f4280, 
    buf=0x7ffff4aee010 "0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u00"..., len=1968975920)
    at /tmpdir/fluent-bit/lib/msgpack-c/include/msgpack/sbuffer.h:88
#2  0x0000555555688715 in msgpack_pack_str_body (x=0x7ffff61f4270, 
    b=0x7ffff4aee010, l=1968975920)
    at /tmpdir/fluent-bit/lib/msgpack-c/include/msgpack/pack_template.h:784
#3  0x000055555568a8e4 in flb_ml_flush_stream_group (ml_parser=0x7ffff663e070, 
    mst=0x7ffff543cf50, group=0x7ffff49e4180)
    at /tmpdir/fluent-bit/src/multiline/flb_ml.c:1012
#4  0x0000555555688c47 in flb_ml_flush_parser_instance (ml=0x7ffff5452150, 
    parser_i=0x7ffff5441f00, stream_id=17521209660683586930)
    at /tmpdir/fluent-bit/src/multiline/flb_ml.c:117
#5  0x0000555555689a33 in flb_ml_append (ml=0x7ffff5452150, 
    stream_id=951830318436279800, type=0, tm=0x7ffff61f4490, 
    buf=0x7ffff5527d40, size=182)
    at /tmpdir/fluent-bit/src/multiline/flb_ml.c:617
#6  0x00005555556c7532 in process_content (file=0x7ffff542f080, 
    bytes=0x7ffff61f45e8) at /tmpdir/fluent-bit/plugins/in_tail/tail_file.c:375
#7  0x00005555556c9a60 in flb_tail_file_chunk (file=0x7ffff542f080)
    at /tmpdir/fluent-bit/plugins/in_tail/tail_file.c:1210
#8  0x00005555556c1f0f in in_tail_collect_event (file=0x7ffff542f080, 
    config=0x7ffff6632f80) at /tmpdir/fluent-bit/plugins/in_tail/tail.c:261
#9  0x00005555556c30e8 in tail_fs_event (ins=0x7ffff6648f00, 
    config=0x7ffff6632f80, in_context=0x7ffff5456000)
    at /tmpdir/fluent-bit/plugins/in_tail/tail_fs_inotify.c:268
#10 0x0000555555669c67 in flb_input_collector_fd (fd=27, config=0x7ffff6632f80)
    at /tmpdir/fluent-bit/src/flb_input.c:1108
#11 0x000055555567c8bc in flb_engine_handle_event (config=0x7ffff6632f80, 
    mask=1, fd=27) at /tmpdir/fluent-bit/src/flb_engine.c:411
#12 flb_engine_start (config=0x7ffff6632f80)
    at /tmpdir/fluent-bit/src/flb_engine.c:696
#13 0x000055555565e83f in flb_lib_worker (data=0x7ffff660b340)
    at /tmpdir/fluent-bit/src/flb_lib.c:627
#14 0x00007ffff7fb1fa3 in start_thread (arg=<optimized out>)
    at pthread_create.c:486
#15 0x00007ffff76924cf in clone ()

I added one debug log in es plugin: https://github.com/GuillaumeSmaha/fluent-bit/commit/5b908dd8edb0a6332718ba2e5460892e7929a9a1 which should not occur because from what I am understanding it means even with buffer resizing, it is still inferior to the requirement size used in the memcpy at the end of the fuction: https://github.com/fluent/fluent-bit/blob/3197e975143b3982d5ce214996e3595f6f12bab3/plugins/out_es/es_bulk.c#L105.

And I got a segv just after:


[1m[[0m2021/12/08 11:55:05[1m][0m [[93mdebug[0m] [out_es] converted_size is 0
[1m[[0m2021/12/08 11:55:05[1m][0m [[93mdebug[0m] [out_es] required size (197133, append: 4096) is still superior to available buffer (37999, previous available: 33903)

Thread 2 "flb-pipeline" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff61ff700 (LWP 45)]
__memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:262
262 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0  __memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:262
#1  0x00005555556fc274 in es_bulk_append (bulk=0x7ffff5400d60, 
    index=0x7ffff4ac0a40 "{\"create\":{\"_index\":\"logstash-2021.12.08\",\"_type\":\"_doc\",\"_id\":\"a76df76b-d2c7-9d2b-e172-93e0ec3c3d4f\"}}\n", i_len=104, 
    json=0x7ffff253fad0 "{\"@timestamp\":\"2021-12-08T11:54:07.027Z\",\"time\":\"2021-12-08T11:54:07.027621077Z\",\"stream\":\"stdout\",\"_p\":\"P\",\"log\":\"\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\"..., j_len=196967, whole_size=33903, converted_size=0)
    at /tmpdir/fluent-bit/plugins/out_es/es_bulk.c:111
#2  0x00005555556f9af9 in elasticsearch_format (config=0x7ffff6632f80, 
    ins=0x7ffff6648f00, plugin_context=0x7ffff54a9600, flush_ctx=0x0, 
    tag=0x7ffff54a2700 "kube.var.log.containers.toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588_toolkit-jobs_toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588-1-db65750898be03ec7ef665854d8036800d9609adeb19a0cc5b021"..., 
    tag_len=215, data=0x7ffff2341f00, bytes=33903, out_data=0x7ffff4ac0f60, 
    out_size=0x7ffff4ac0f58) at /tmpdir/fluent-bit/plugins/out_es/es.c:537
#3  0x00005555556fa8c5 in cb_es_flush (data=0x7ffff2341f00, bytes=33903, 
    tag=0x7ffff54a2700 "kube.var.log.containers.toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588_toolkit-jobs_toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588-1-db65750898be03ec7ef665854d8036800d9609adeb19a0cc5b021"..., 
    tag_len=215, ins=0x7ffff6648f00, out_context=0x7ffff54a9600, 
    config=0x7ffff6632f80) at /tmpdir/fluent-bit/plugins/out_es/es.c:772
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x000055555566b2df in output_pre_cb_flush ()
    at /tmpdir/fluent-bit/include/fluent-bit/flb_output.h:514
#5  0x0000555555b69d47 in co_init ()
    at /tmpdir/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6  0x0000000000000000 in ?? ()
(gdb)

gdb-fluent-bit-4vljw.log

@GuillaumeSmaha great digging, thank you.

@bstolz I will create another issue because the issue found is inside plugin "elasticsearch output". Reverting https://github.com/fluent/fluent-bit/pull/3788 seems to fix my issue (fluentbit doesn't crash anymore) Maybe, there is another issue with the multiline parser (with the backtrace I got here https://github.com/fluent/fluent-bit/issues/4195#issuecomment-988730077)

I just created https://github.com/fluent/fluent-bit/issues/4412

@bstolz My fix on es output was merged on master and branch 1.8. I am not sure it will fix your issue but you can try

@GuillaumeSmaha wonderful! Thank you!

That was quick ;-) I appreciate you diving in and posting that example.

I have tried fluent-bit versions 1.7.x and 1.8.x and they all crash using tail input and elastic output with heavy JSON formatted logs. I tried the change in commit 0a061fb. Fluent-bit was a bit more stable but still crashed. I have gone back to version 1.6.10 which "just runs".

I have noticed another issue with the 1.8.x series, it is continually logging errors on writing to elastic, but the "status": element in the elastic response only ever shows status 201 which should be "document created" and not an error. When I see errors reported with 1.6 I always see a real error status in the response, like a 429 for example. I think that the elastic response handling has also been broken.

@fhitchen can you try on the latest 1.8.11 with this change: https://github.com/fluent/fluent-bit/pull/4361 (1.8.11 already includes a change like https://github.com/fluent/fluent-bit/commit/0a061fbb73922dee737bc50837ca7a3fc7b3001c)

So I took the nokute78:fix_4311_es_fix and it ran for a few minutes before SEGV-ing. I'm back to 1.6.20.

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Hi, I just built and tested the 1.8.13 fluent-bit release and it crashed after processing for about 5 minutes. Again 1.6.20 works flawlessly so something has been badly broken in the elastic plugin memory management since 1.7. It does not appear to be related to the input data, I have tried reprocessing the same file and it does not crash in the same place. If I replace the output with STDOUT it does not crash.

Fluent Bit v1.8.13
* Copyright (C) 2015-2021 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/03/21 14:59:12] [engine] caught signal (SIGSEGV)
#0  0x7ffff6647fbe      in  ???() at ???:0
#1  0x44dde2            in  flb_free() at include/fluent-bit/flb_mem.h:122
#2  0x44ef11            in  flb_sds_destroy() at src/flb_sds.c:408
#3  0x4e1dfb            in  elasticsearch_format() at plugins/out_es/es.c:540
#4  0x4e2b6b            in  cb_es_flush() at plugins/out_es/es.c:772
#5  0x456a08            in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:517
#6  0x787905            in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
Segmentation fault

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

This issue was closed because it has been stalled for 5 days with no activity.

Similar issue or related issue #3585:

Info 1:

$ kubectl get pods  -n amazon-cloudwatch

NAME               READY   STATUS             RESTARTS        AGE
fluent-bit-8t7fk   0/1     CrashLoopBackOff   6 (2m42s ago)   8m29s
fluent-bit-98w2b   0/1     CrashLoopBackOff   6 (2m34s ago)   8m29s
fluent-bit-chzjh   0/1     CrashLoopBackOff   6 (2m45s ago)   8m29s
fluent-bit-gfs7x   0/1     CrashLoopBackOff   6 (2m32s ago)   8m29s
fluent-bit-szn69   0/1     CrashLoopBackOff   6 (2m38s ago)   8m29s
fluent-bit-wxbkm   0/1     CrashLoopBackOff   6 (2m46s ago)   8m29s

Info 2:

$ kubectl  logs fluent-bit-wxbkm -n amazon-cloudwatch

Fluent Bit v1.9.10
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/09/06 16:58:31] [ info] [fluent bit] version=1.9.10, commit=a2eaf59628, pid=1
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693618299.930665963.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693618299.938087318.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693655545.974952011.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693681220.89964544.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693681223.693057732.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693723157.518142203.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693731509.684250129.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693744367.53739593.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693771245.718195391.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693864172.706175934.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693866941.88886208.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693914041.952195655.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693972809.563850435.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1694017319.593944428.flb
[2023/09/06 16:58:31] [ info] [storage] version=1.4.0, type=memory+filesystem, sync=normal, checksum=disabled, max_chunks_up=128
[2023/09/06 16:58:31] [ info] [storage] backlog input plugin: storage_backlog.7
[2023/09/06 16:58:31] [ info] [cmetrics] version=0.3.7
[2023/09/06 16:58:31] [ info] [input:tail:tail.0] multiline core started
[2023/09/06 16:58:31] [ info] [input:tail:tail.1] multiline core started
[2023/09/06 16:58:31] [ info] [input:systemd:systemd.2] seek_cursor=s=2866bded7659498a96e0a8ad2beccf0a;i=7bf... OK
[2023/09/06 16:58:31] [ info] [input:tail:tail.3] multiline core started
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] queue memory limit: 4.8M
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] https=1 host=127.0.0.1 port=10250
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with Kubelet...
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2023/09/06 16:58:31] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] worker #0 started
[2023/09/06 16:58:31] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] worker #0 started
[2023/09/06 16:58:31] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] worker #0 started
[2023/09/06 16:58:31] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2023/09/06 16:58:31] [ info] [sp] stream processor started
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019169.378350213.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019170.248838268.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019187.497863650.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019215.460809884.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019256.480162041.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019339.487434952.flb
[2023/09/06 16:58:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=14152341 watch_fd=3 name=/var/log/containers/coredns-6ff9c46cd8-c2h4p_kube-system_coredns-3c90a2ab2fae2141af415dc72026004233e8de1f17edd6bbc2df9e68f607ed92.log
[2023/09/06 16:58:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=14152350 watch_fd=4 name=/var/log/containers/coredns-6ff9c46cd8-c2h4p_kube-system_coredns-4dec2ea945a6ee553ffd5ff906c33358f5bef079cf13809111db2f95cdc082cd.log
[2023/09/06 16:58:31] [engine] caught signal (SIGSEGV)
AWS for Fluent Bit Container Image Version 2.31.12.20230821

Info 3:

$ kubectl  describe pod fluent-bit-8t7fk -n amazon-cloudwatch
Name:             fluent-bit-8t7fk
Namespace:        amazon-cloudwatch
Priority:         0
Service Account:  fluent-bit
Node:             ip-11-0-156-115.us-west-2.compute.internal/11.0.156.115
Start Time:       Wed, 06 Sep 2023 09:52:48 -0700
Labels:           controller-revision-hash=6cb6db756b
                  k8s-app=fluent-bit
                  kubernetes.io/cluster-service=true
                  pod-template-generation=1
                  version=v1
Annotations:      <none>
Status:           Running
IP:               11.0.156.115
IPs:
  IP:           11.0.156.115
Controlled By:  DaemonSet/fluent-bit
Containers:
  fluent-bit:
    Container ID:   containerd://69bacf029e31a28f243e051c72d49edf24416173b9e9c88e67d2d393cd775af4
    Image:          public.ecr.aws/aws-observability/aws-for-fluent-bit:latest
    Image ID:       public.ecr.aws/aws-observability/aws-for-fluent-bit@sha256:2e52d17a2f34197707dcfe27626cfa45c30ced37999004638768bf533cd7f444
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    139
      Started:      Wed, 06 Sep 2023 10:14:01 -0700
      Finished:     Wed, 06 Sep 2023 10:14:01 -0700
    Ready:          False
    Restart Count:  9
    Limits:
      memory:  200Mi
    Requests:
      cpu:     500m
      memory:  100Mi
    Environment:
      AWS_REGION:             <set to the key 'logs.region' of config map 'fluent-bit-cluster-info'>   Optional: false
      CLUSTER_NAME:           <set to the key 'cluster.name' of config map 'fluent-bit-cluster-info'>  Optional: false
      HTTP_SERVER:            <set to the key 'http.server' of config map 'fluent-bit-cluster-info'>   Optional: false
      HTTP_PORT:              <set to the key 'http.port' of config map 'fluent-bit-cluster-info'>     Optional: false
      READ_FROM_HEAD:         <set to the key 'read.head' of config map 'fluent-bit-cluster-info'>     Optional: false
      READ_FROM_TAIL:         <set to the key 'read.tail' of config map 'fluent-bit-cluster-info'>     Optional: false
      HOST_NAME:               (v1:spec.nodeName)
      HOSTNAME:               fluent-bit-8t7fk (v1:metadata.name)
      CI_VERSION:             k8s/1.3.16
    Mounts:
      /fluent-bit/etc/ from fluent-bit-config (rw)
      /run/log/journal from runlogjournal (ro)
      /var/fluent-bit/state from fluentbitstate (rw)
      /var/lib/docker/containers from varlibdockercontainers (ro)
      /var/log from varlog (ro)
      /var/log/dmesg from dmesg (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h2lml (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  fluentbitstate:
    Type:          HostPath (bare host directory volume)
    Path:          /var/fluent-bit/state
    HostPathType:  
  varlog:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  varlibdockercontainers:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/docker/containers
    HostPathType:  
  fluent-bit-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      fluent-bit-config
    Optional:  false
  runlogjournal:
    Type:          HostPath (bare host directory volume)
    Path:          /run/log/journal
    HostPathType:  
  dmesg:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/dmesg
    HostPathType:  
  kube-api-access-h2lml:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 :NoExecute op=Exists
                             :NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  22m                   default-scheduler  Successfully assigned amazon-cloudwatch/fluent-bit-8t7fk to ip-11-0-156-115.us-west-2.compute.internal
  Normal   Pulled     22m                   kubelet            Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 160.524711ms (160.534769ms including waiting)
  Normal   Pulled     22m                   kubelet            Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 180.919554ms (180.968511ms including waiting)
  Normal   Pulled     22m                   kubelet            Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 166.16697ms (166.176944ms including waiting)
  Normal   Started    21m (x4 over 22m)     kubelet            Started container fluent-bit
  Normal   Pulled     21m                   kubelet            Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 145.097853ms (145.109643ms including waiting)
  Normal   Pulling    21m (x5 over 22m)     kubelet            Pulling image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest"
  Normal   Created    21m (x5 over 22m)     kubelet            Created container fluent-bit
  Normal   Pulled     21m                   kubelet            Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 153.296922ms (153.322186ms including waiting)
  Warning  BackOff    2m29s (x94 over 22m)  kubelet

Info 4:

# fluent-bit.conf
fluent_bit_conf = """
[SERVICE]
    Flush             5
    Grace             30
    Log_Level         info
    Daemon            off
    Parsers_File          parsers.conf
    HTTP_Server           ${HTTP_SERVER}
    HTTP_Listen           0.0.0.0
    HTTP_Port         ${HTTP_PORT}
    storage.path          /var/fluent-bit/state/flb-storage/
    storage.sync          normal
    storage.checksum      off
    storage.backlog.mem_limit 5M
@INCLUDE application-log.conf
@INCLUDE dataplane-log.conf
@INCLUDE host-log.conf
"""
# application-log.conf
application_log_conf = """
[INPUT]
    Name        tail
    Tag     application.<namespace_name>.<deployment_name>.log
    Tag_Regex   ^\/var\/log\/containers\/(?<pod_name>[^_]+)_(?<namespace_name>[^_]+)_(?<deployment_name>[^_]+)-(?<docker_id>[a-z0-9]{64})\.log$
    Exclude_Path    /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
    Path        /var/log/containers/*.log
    multiline.parser    docker, cri
    DB          /var/fluent-bit/state/flb_container.db
    Mem_Buf_Limit       50MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Rotate_Wait     30
    storage.type    filesystem
    Read_from_Head      ${READ_FROM_HEAD}
[INPUT]
    Name        tail
    Tag         application.*
    Path        /var/log/containers/cloudwatch-agent*
    multiline.parser    docker, cri
    DB          /var/fluent-bit/state/flb_cwagent.db
    Mem_Buf_Limit       5MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Read_from_Head      ${READ_FROM_HEAD}
[FILTER]
    Name        kubernetes
    Match           application.*
    Kube_URL        https://kubernetes.default.svc:443
    Kube_Tag_Prefix     application.
    Regex_Parser        k8s-custom-tag
    Merge_Log       On
    Merge_Log_Key       log_processed
    K8S-Logging.Parser  On
    K8S-Logging.Exclude Off
    Labels          Off
    Annotations     Off
    Use_Kubelet     On
    Kubelet_Port    10250
    Buffer_Size     0
[OUTPUT]
    Name        cloudwatch_logs
    Match           application.*
    region          ${AWS_REGION}
    log_group_name      /${CLUSTER_NAME}/application
    log_stream_prefix   log-
    auto_create_group   true
    extra_user_agent    container-insights
"""

# dataplane-log.conf
dataplane_log_conf = """
[INPUT]
    Name        systemd
    Tag         dataplane.systemd.*
    Systemd_Filter      _SYSTEMD_UNIT=docker.service
    Systemd_Filter      _SYSTEMD_UNIT=containerd.service
    Systemd_Filter      _SYSTEMD_UNIT=kubelet.service
    DB          /var/fluent-bit/state/systemd.db
    Path        /var/log/journal
    Read_From_Tail      ${READ_FROM_TAIL}
[INPUT]
    Name        tail
    Tag         dataplane.tail.*
    Path        /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
    multiline.parser    docker, cri
    DB          /var/fluent-bit/state/flb_dataplane_tail.db
    Mem_Buf_Limit       50MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Rotate_Wait     30
    storage.type    filesystem
    Read_from_Head      ${READ_FROM_HEAD}
[FILTER]
    Name        modify
    Match           dataplane.systemd.*
    Rename          _HOSTNAME           hostname
    Rename          _SYSTEMD_UNIT           systemd_unit
    Rename          MESSAGE             message
    Remove_regex    ^((?!hostname|systemd_unit|message).)*$
[FILTER]
    Name        aws
    Match           dataplane.*
    imds_version    v1
[OUTPUT]
    Name        cloudwatch_logs
    Match           dataplane.*
    region          ${AWS_REGION}
    log_group_name      /${CLUSTER_NAME}/dataplane
    log_stream_prefix   log-
    auto_create_group   true
    extra_user_agent    container-insights
"""
# host-log.conf
host_log_conf = """
[INPUT]
    Name        tail
    Tag         host.dmesg
    Path        /var/log/dmesg
    Key         message
    DB          /var/fluent-bit/state/flb_dmesg.db
    Mem_Buf_Limit       5MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Read_from_Head      ${READ_FROM_HEAD}
[INPUT]
    Name        tail
    Tag         host.messages
    Path        /var/log/messages
    Parser          syslog
    DB          /var/fluent-bit/state/flb_messages.db
    Mem_Buf_Limit       5MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Read_from_Head      ${READ_FROM_HEAD}
[INPUT]
    Name        tail
    Tag         host.secure
    Path        /var/log/secure
    Parser          syslog
    DB          /var/fluent-bit/state/flb_secure.db
    Mem_Buf_Limit       5MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Read_from_Head      ${READ_FROM_HEAD}
[FILTER]
    Name        aws
    Match           host.*
    imds_version    v1
[OUTPUT]
    Name        cloudwatch_logs
    Match           host.*
    region          ${AWS_REGION}
    log_group_name      /${CLUSTER_NAME}/host
    log_stream_prefix   log-
    auto_create_group   true
    extra_user_agent    container-insights
"""
# parsers.conf
parsers_conf = r"""
[PARSER]
    Name        syslog
    Format          regex
    Regex           ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
    Time_Key        time
    Time_Format     %b %d %H:%M:%S
[PARSER]
    Name        container_firstline
    Format          regex
    Regex           (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
    Time_Key        time
    Time_Format     %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
    Name        cwagent_firstline
    Format          regex
    Regex           (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
    Time_Key        time
    Time_Format     %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
    Name        k8s-custom-tag
    Format      regex
    Regex       ^(?<namespace_name>[^.]+)\.(?<deployment_name>[^.]+)\.log$
"""

Codes:

https://github.com/omidraha/pulumi_example/blob/main/cw/cw.py https://github.com/omidraha/pulumi_example/blob/main/cw/setup.py

fluent / fluent-bit