Closed bstolz closed 2 years ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.
I got a similar issue:
[2021/11/14 18:32:22] [engine] caught signal (SIGSEGV)
#0 0x55585d8f2fa1 in destroy_chunk_list() at lib/msgpack-c/src/zone.c:39
#1 0x55585d8f336f in msgpack_zone_destroy() at lib/msgpack-c/src/zone.c:175
#2 0x55585d8f3492 in msgpack_zone_free() at lib/msgpack-c/src/zone.c:220
#3 0x55585d8f0ac2 in msgpack_unpacked_destroy() at lib/msgpack-c/include/msgpack/unpack.h:260
#4 0x55585d8f2e02 in msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:661
#5 0x55585d4a1392 in elasticsearch_format() at plugins/out_es/es.c:358
#6 0x55585d4a20f2 in cb_es_flush() at plugins/out_es/es.c:772
#7 0x55585d41427c in output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#8 0x55585d911046 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9 0x7ff320a13234 in ???() at ???:0
or
[2021/11/16 23:07:04] [debug] [input:tail:tail.1] 0 new files found on path '/var/log/syslog'
[2021/11/16 23:07:04] [debug] [input:tail:tail.0] inode=40894548 events: IN_MODIFY
[2021/11/16 23:07:05] [debug] [input:tail:tail.0] inode=40894626 events: IN_MODIFY
[2021/11/16 23:07:05] [debug] [output:es:es.0] HTTP Status=400 URI=/_bulk
[2021/11/16 23:07:05] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Malformed action/metadata line [3], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]"}],"type":"illegal_argument_exception","reason":"Malformed action/metadata line [3], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]"},"status":400}
[2021/11/16 23:07:05] [debug] [upstream] KA connection #120 to elasticsearch-master:9200 is now available
[2021/11/16 23:07:05] [debug] [input:tail:tail.0] inode=40894548 events: IN_MODIFY
[2021/11/16 23:07:05] [debug] [http_client] not using http_proxy for header
[2021/11/16 23:07:05] [debug] [http_client] not using http_proxy for header
[2021/11/16 23:07:06] [debug] [out_es] converted_size is 0
[2021/11/16 23:07:06] [engine] caught signal (SIGSEGV)
#0 0x7f75158f9701 in ???() at ???:0
or
[2021/11/25 11:16:36] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"}],"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"},"status":400}
[2021/11/25 11:16:36] [ warn] [engine] failed to flush chunk '1-1637838990.757540396.flb', retry in 6 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
[2021/11/25 11:16:47] [engine] caught signal (SIGSEGV)
#0 0x55f0c1dd9861 in cio_chunk_is_up() at lib/chunkio/src/cio_chunk.c:474
#1 0x55f0c1b047af in flb_input_chunk_flush() at src/flb_input_chunk.c:1456
#2 0x55f0c1ae3d4b in flb_engine_dispatch() at src/flb_engine_dispatch.c:243
#3 0x55f0c1ae1e25 in flb_engine_flush() at src/flb_engine.c:115
#4 0x55f0c1ae31d1 in flb_engine_handle_event() at src/flb_engine.c:396
#5 0x55f0c1ae31d1 in flb_engine_start() at src/flb_engine.c:696
#6 0x55f0c1ac581e in flb_lib_worker() at src/flb_lib.c:627
#7 0x7f10f3fbefa2 in ???() at ???:0
#8 0x7f10f369f4ce in ???() at ???:0
#9 0xffffffffffffffff in ???() at ???:0
Is it possible a parsing issue is creating a segfault ?
"Is it possible a parsing issue is creating a segfault ?"
I think so. I am currently experimenting with adding "more and more" functionality back into my parser.
Here's the parser line that I was using when I see the segfaults:
[PARSER]
Name elasticsearch_multiline
Format regex
Regex ^*\\"type\\": \\"(?<type>server)\\", \\"timestamp\\": \\"(?<time>[^\\]*)\\", \\"level\\": \\"(?<severity>[^\\]*)\\", \\"component\\": \\"(?<component>[^\\]*)\\", \\"cluster.name\\": \\"(?<cluster_name>[^\\]*)\\", \\"node.name\\": \\"(?<node_name>[^\\]*)\\", \\"message\\": \\"(?<message>[^\\]*)\\", \\"cluster.uuid\\": \\"(?<cluster_uuid>[^\\]*)\\", \\"node.id\\": \\"(?<node_id>[^\\]*)\\"[,]?[\s]?(?<stacktrace>[^\`]*$)
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S,%L%z
I did a test by replacing elastic-search output by simple stdout and I don't have segfault anymore. elastic-search output:
[OUTPUT]
Name es
Match *
Host elasticsearch-master
Port 9200
tls Off
tls.verify Off
Generate_ID On
Logstash_Format On
Time_Key_Format %Y-%m-%dT%H:%M:%S
Replace_Dots On
Time_Key ts
Retry_Limit False
Buffer_Size 64KB
stdout used:
[OUTPUT]
Name stdout
Match *
@bstolz can you try on your side with stdout to confirm it could be an issue with elastic-search exporter ?
Ahh you're using the elasticsearch exporter - I am exporting to stackdriver and getting the same. I can try standard out on a non-production environment.
I upgraded to 1.8.10
:
stdout
es
I have
ogs fluent-bit-xjmdk
Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
Invalid configuration value at FILTER.Name
[2021/11/30 09:07:34] [ info] [engine] started (pid=1)
[2021/11/30 09:07:34] [ info] [storage] version=1.1.5, initializing...
[2021/11/30 09:07:34] [ info] [storage] in-memory
[2021/11/30 09:07:34] [ info] [storage] full synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/30 09:07:34] [ info] [cmetrics] version=0.2.2
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] multiline core started
[2021/11/30 09:07:34] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2021/11/30 09:07:34] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2021/11/30 09:07:34] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2021/11/30 09:07:34] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2021/11/30 09:07:34] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/11/30 09:07:34] [ info] [sp] stream processor started
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224257 watch_fd=3 name=/var/log/containers/cilium-mlvm6_kube-system_cilium-agent-bdfb18631cb184400d7045bcf03e01eb592c8737af9963dead8eee41c6cb3aea.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224846 watch_fd=4 name=/var/log/containers/cilium-mlvm6_kube-system_cilium-agent-ee021432c3c584782aba98c77d8d6d90fca3be851e42403b68fb9911926829c1.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224845 watch_fd=5 name=/var/log/containers/cilium-mlvm6_kube-system_clean-cilium-state-2ad64fcca13c2b12be4b85b06118e09f82dfe372ec39dd02cd214be83951ccdc.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224823 watch_fd=6 name=/var/log/containers/cilium-mlvm6_kube-system_mount-cgroup-7e466ef8836bece0943fa6a65d9b2c9426f23c386b779cf3f4325ab7bc81ca02.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224838 watch_fd=7 name=/var/log/containers/cilium-mlvm6_kube-system_wait-for-node-init-2c1a981f06c37c999741ac23a609afdde95f05d80880d296a2935a34f4044887.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18220129 watch_fd=8 name=/var/log/containers/cilium-node-init-5cm7k_kube-system_node-init-26156fdfe5ef2b8a28e4f497bf5ae796c9fd54847979f4ceec5e7842158e13f3.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224480 watch_fd=9 name=/var/log/containers/cilium-node-init-5cm7k_kube-system_node-init-d760121abf51c6dde80769dddf2c54dee2342588fbc83974ea7a0ce2c1a0812c.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18354769 watch_fd=11 name=/var/log/containers/fluent-bit-xjmdk_monitoring_fluent-bit-b731b932d1f4455c1bc1267337a799e3d09a6388774c27362ce1d0281ffc2338.log
...
2021/11/30 09:07:41] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18354768 watch_fd=47 name=/var/log/containers/fluent-bit-xjmdk_monitoring_fluent-bit-f004961b80d7ac0bc5a58506270837eb5e34350abd0250478afee6925727283d.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.1] inotify_fs_add(): inode=18089658 watch_fd=1 name=/var/log/syslog
[2021/11/30 09:07:41] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:41] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:43] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:43] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:08:38] [engine] caught signal (SIGSEGV)
#0 0x5566019fcb51 in destroy_chunk_list() at lib/msgpack-c/src/zone.c:39
#1 0x5566019fcf1f in msgpack_zone_destroy() at lib/msgpack-c/src/zone.c:175
#2 0x5566019fd042 in msgpack_zone_free() at lib/msgpack-c/src/zone.c:220
#3 0x5566019fa672 in msgpack_unpacked_destroy() at lib/msgpack-c/include/msgpack/unpack.h:260
#4 0x5566019fc9b2 in msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:661
#5 0x5566015aaa88 in elasticsearch_format() at plugins/out_es/es.c:358
#6 0x5566015ab7e8 in cb_es_flush() at plugins/out_es/es.c:772
#7 0x55660151d28e in output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#8 0x556601a1ac06 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9 0x7ffbf3c13234 in ???() at ???:0
First time I saw this line Invalid configuration value at FILTER.Name
lol but I think it is not related to the issue
UPDATED: Issue fixed for Invalid configuration value at FILTER.Name
and I still have the sigfault:
[2021/11/30 09:54:38] [engine] caught signal (SIGSEGV)
#0 0x7f248b8e0820 in ???() at ???:0
#1 0x5571447fc2cd in msgpack_sbuffer_write() at lib/msgpack-c/include/msgpack/sbuffer.h:88
#2 0x5571447fc09f in msgpack_pack_str_body() at lib/msgpack-c/include/msgpack/pack_template.h:784
#3 0x5571447fe0cb in flb_ml_flush_stream_group() at src/multiline/flb_ml.c:981
#4 0x5571447fc5d1 in flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#5 0x5571447fd3bd in flb_ml_append() at src/multiline/flb_ml.c:617
#6 0x55714483a4c3 in process_content() at plugins/in_tail/tail_file.c:375
#7 0x55714483c9f1 in flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1210
#8 0x557144834ea0 in in_tail_collect_event() at plugins/in_tail/tail.c:261
#9 0x557144836079 in tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#10 0x5571447ddc16 in flb_input_collector_fd() at src/flb_input.c:1108
#11 0x5571447f03fa in flb_engine_handle_event() at src/flb_engine.c:411
#12 0x5571447f03fa in flb_engine_start() at src/flb_engine.c:696
#13 0x5571447d281e in flb_lib_worker() at src/flb_lib.c:627
#14 0x7f248c19cfa2 in ???() at ???:0
#15 0x7f248b87d4ce in ???() at ???:0
#16 0xffffffffffffffff in ???() at ???:0
As mentioned previously, I am logging to stackdriver and upon changing it to stdout I still see the same issue. I'm slowly removing some of the "extra" lines I have in the input block - multiline, skip long lines, and reduction of the buffer_max_size to see how it goes. Then I will set all of the buffers to default. After that, I will start investigating the regex.
thanks for the info, would you please provide a simple repro case so we can troubleshoot ?
I am observing this running on all of our elasticsearch 7 clusters (ECK) in GKE - it happens a few times in the period of 24hrs. I am using the same configmap for other clusters not running elasticsearch and I do not see the issue.
I am not sure how to provide a "simple" repro case given an elasticsearch cluster is involved. I've included the relevant bits of the configmap. Let me know how else I can provide what you need? I can also extend an offer to help as I do have a demo environment.
`
[INPUT]
Name tail
Path /var/log/containers/*_elasticsearch*.log
Exclude_Path /var/log/containers/*_kube-system_*.log,/var/log/containers/*_istio-system_*.log,/var/log/containers/*_knative-serving_*.log,/var/log/containers/*_gke-system_*.log,/var/log/containers/*_config-management-system_*.log,*_istio-proxy*.log,/var/log/containers/*_dd-agent*.log, /var/log/containers/*_fluent-bit-custom*.log
Tag kube.elasticsearch.*
DB /var/log/flb_elasticsearch.db
Read_from_head true
Multiline off
Parser_Firstline elasticsearch_multiline
Mem_Buf_Limit 20MB
Refresh_Interval 10
Buffer_Chunk_Size 256k
Buffer_Max_Size 1024k
[FILTER]
Name parser
Match kube.*
Key_Name log
Reserve_Data True
Parser docker
[FILTER]
Name modify
Match *
Hard_rename log message
[FILTER]
Name parser
Match kube.*
Key_Name message
Reserve_Data True
Parser glog
Parser json
[OUTPUT]
Name stackdriver
Match kube.*
resource k8s_container
k8s_cluster_name <redacted>
k8s_cluster_location <redacted>
severity_key severity
[PARSER]
Name elasticsearch_multiline
Format regex
Regex ^*\\"type\\": \\"(?<type>server)\\", \\"timestamp\\": \\"(?<time>[^\\]*)\\", \\"level\\": \\"(?<severity>[^\\]*)\\", \\"component\\": \\"(?<component>[^\\]*)\\", \\"cluster.name\\": \\"(?<cluster_name>[^\\]*)\\", \\"node.name\\": \\"(?<node_name>[^\\]*)\\", \\"message\\": \\"(?<message>[^\\]*)\\", \\"cluster.uuid\\": \\"(?<cluster_uuid>[^\\]*)\\", \\"node.id\\": \\"(?<node_id>[^\\]*)\\"[,]?[\s]?(?<stacktrace>[^\`]*$)
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S,%L%z
[PARSER]
Name json
Format json
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name glog
Format regex
Regex ^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source_file>[^ \]]+)\:(?<source_line>\d+)\]\s(?<message>.*)$
Time_Key time
Time_Format %m%d %H:%M:%S.%L%z
`
I did a new install of the latest chart for elasticsearch (docker image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0
)
I clean my configuration to keep less parser/filter and it still crashs with this configuration:
[SERVICE]
Config_Watch On
# Flush
# =====
# set an interval of seconds before to flush records to a destination
flush 1
# Daemon
# ======
# instruct Fluent Bit to run in foreground or background mode.
daemon Off
# Log_Level
# =========
# Set the verbosity level of the service, values can be:
#
# - error
# - warning
# - info
# - debug
# - trace
#
# by default 'info' is set, that means it includes 'error' and 'warning'.
log_level info
# Parsers File
# ============
# specify an optional 'Parsers' configuration file
parsers_file parsers.conf
parsers_file custom_parsers.conf
# HTTP Server
# ===========
# Enable/Disable the built-in HTTP Server for metrics
http_server On
http_listen 0.0.0.0
http_port 2020
# Storage
# =======
storage.sync full
storage.checksum on
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
# Except fluent-bit to avoid feedback loop
Exclude_Path ["*fluent-bit*.log"]
DB /var/log/fluentbit/flb_kube.db
Mem_Buf_Limit 64MB
Skip_Long_Lines On
Refresh_Interval 10
# Parser for containerd (Container Runtime Interface)
multiline.parser cri
[INPUT]
Name tail
Tag system.*
Path /var/log/syslog
Parser rsyslog
DB /var/log/fluentbit/flb_syslog.db
Mem_Buf_Limit 64MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[OUTPUT]
Name es
Match *
Host elasticsearch-master
Port 9200
tls Off
tls.verify Off
Generate_ID On
Logstash_Format On
Time_Key_Format %Y-%m-%dT%H:%M:%S
Replace_Dots On
Time_Key ts
Retry_Limit False
Buffer_Size 64KB
[PARSER]
Name rsyslog
Format regex
Regex (?<syslog_time>^[A-Za-z]+ [ 1-3][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2}) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<log>.*)$
Time_Key syslog_time
Time_Format %b %d %H:%M:%S
Time_Keep On
log:
Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2021/11/30 16:56:31] [ info] [engine] started (pid=1)
[2021/11/30 16:56:31] [ info] [storage] version=1.1.5, initializing...
[2021/11/30 16:56:31] [ info] [storage] in-memory
[2021/11/30 16:56:31] [ info] [storage] full synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/30 16:56:31] [ info] [cmetrics] version=0.2.2
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] multiline core started
[2021/11/30 16:56:31] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2021/11/30 16:56:31] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2021/11/30 16:56:31] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2021/11/30 16:56:31] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2021/11/30 16:56:31] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/11/30 16:56:31] [ info] [sp] stream processor started
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885569 watch_fd=2 name=/var/log/containers/cilium-kvwhq_kube-system_cilium-agent-9f584588e4be9ae6ee66cfa45ad7458d2d891b6fb680d2f18bd2910f62bd142d.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885566 watch_fd=3 name=/var/log/containers/cilium-kvwhq_kube-system_clean-cilium-state-4949ccf208b74e3c36949068cec5bf4a8f22308e6ddf32a146294ee0940c2fe1.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885535 watch_fd=4 name=/var/log/containers/cilium-kvwhq_kube-system_mount-cgroup-b2939032ba97cc068dce2ac616d6a4929faff24f7304231ce62a4a1a5c24c1b3.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885563 watch_fd=5 name=/var/log/containers/cilium-kvwhq_kube-system_wait-for-node-init-c5d20e3c171339e307120e69fa64baa706b4d83a1514470666ad184ae1542ad8.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885560 watch_fd=6 name=/var/log/containers/cilium-node-init-2v4n6_kube-system_node-init-d5fab00554f962d3cea541308f6358e850638d0643f923ef5347c8c28a92c757.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885364 watch_fd=7 name=/var/log/containers/fluent-bit-9h2rs_monitoring_fluent-bit-ed4d4dc0f608994a6bcb613bc49a121b2fd74d4aebb5cace4fc6f3f316b6a660.log
...
[2021/11/30 16:56:31] [ info] [input:tail:tail.1] inotify_fs_add(): inode=56885366 watch_fd=1 name=/var/log/syslog
[2021/11/30 16:58:24] [engine] caught signal (SIGSEGV)
#0 0x55ff5827cb51 in destroy_chunk_list() at lib/msgpack-c/src/zone.c:39
#1 0x55ff5827cf1f in msgpack_zone_destroy() at lib/msgpack-c/src/zone.c:175
#2 0x55ff5827d042 in msgpack_zone_free() at lib/msgpack-c/src/zone.c:220
#3 0x55ff5827a672 in msgpack_unpacked_destroy() at lib/msgpack-c/include/msgpack/unpack.h:260
#4 0x55ff5827c9b2 in msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:661
#5 0x55ff57e2aa88 in elasticsearch_format() at plugins/out_es/es.c:358
#6 0x55ff57e2b7e8 in cb_es_flush() at plugins/out_es/es.c:772
#7 0x55ff57d9d28e in output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#8 0x55ff5829ac06 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9 0x7f19e864c04d in ???() at ???:0
Another log after the pod crashed:
[2021/11/30 17:02:06] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"}],"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"},"status":400}
[2021/11/30 17:02:06] [ warn] [engine] failed to flush chunk '\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u00�*ec', retry in 9 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:07] [ warn] [engine] failed to flush chunk '1-1638291726.807180974.flb', retry in 11 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:17] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"}],"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"},"status":400}
[2021/11/30 17:02:17] [ warn] [engine] failed to flush chunk '00\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u
', retry in 16 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:18] [ warn] [engine] failed to flush chunk '1-1638291726.807180974.flb', retry in 17 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:36] [ warn] [engine] failed to flush chunk '1-1638291726.807180974.flb', retry in 8 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:43] [ warn] [engine] failed to flush chunk '1-1638291726.807180974.flb', retry in 58 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:03:22] [engine] caught signal (SIGSEGV)
#0 0x7f6316a0c7b0 in ???() at ???:0
#1 0x55600e56093a in flb_sds_cat() at src/flb_sds.c:132
#2 0x55600e5609a4 in flb_sds_cat_safe() at src/flb_sds.c:145
#3 0x55600e583b82 in package_content() at src/multiline/flb_ml.c:324
#4 0x55600e584030 in process_append() at src/multiline/flb_ml.c:495
#5 0x55600e58424e in ml_append_try_parser() at src/multiline/flb_ml.c:564
#6 0x55600e584456 in flb_ml_append() at src/multiline/flb_ml.c:630
#7 0x55600e5c14c3 in process_content() at plugins/in_tail/tail_file.c:375
#8 0x55600e5c39f1 in flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1210
#9 0x55600e5bbea0 in in_tail_collect_event() at plugins/in_tail/tail.c:261
#10 0x55600e5bd079 in tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#11 0x55600e564c16 in flb_input_collector_fd() at src/flb_input.c:1108
#12 0x55600e5773fa in flb_engine_handle_event() at src/flb_engine.c:411
#13 0x55600e5773fa in flb_engine_start() at src/flb_engine.c:696
#14 0x55600e55981e in flb_lib_worker() at src/flb_lib.c:627
#15 0x7f63172c8fa2 in ???() at ???:0
#16 0x7f63169a94ce in ???() at ???:0
#17 0xffffffffffffffff in ???() at ???:0
Full log with trace enabled: fluent-bit-9h2rs.log.zip
I saw the CPU
usage is high, I could allow more CPU or remove checksum and full sync.
I added
storage.sync full
storage.checksum on
because I read an issue where it recommanded to enable it when there is too much warning log with [ warn] [engine] failed to flush chunk '1-1638291872.61468128.flb', retry in 11 seconds: task_id=1, input=tail.0 > output=es.0 (out_id=0)
UPDATE: remove storage.sync full
& storage.checksum on
and increase cpu to 2 (reach 1 cpu usage): no change for the segfault
I run valgrind on a debug build of master branch (9ccc6f89171d6792f9605b77b4efd27bc28f472a) valgrind-fluent-bit-w67h2-2.log
And another with option --leak-check=full
:
valgrind-fluent-bit-w67h2-3.log
They crash with the same stack:
#0 0x483c933 in ???() at ???:0
#1 0x2b0116 in es_bulk_append() at plugins/out_es/es_bulk.c:105
#2 0x2ada44 in elasticsearch_format() at plugins/out_es/es.c:537
#3 0x2ae828 in cb_es_flush() at plugins/out_es/es.c:772
#4 0x21f072 in output_pre_cb_flush() at include/fluent-bit/flb_output.h:511
#5 0x71c0c6 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
It seems output_pre_cb_flush()
is present in all SEGV stack we got with @bstolz (https://github.com/fluent/fluent-bit/blob/master/include/fluent-bit/flb_output.h#L511)
@bstolz I notice you use
Parser_Firstline elasticsearch_multiline
and I am using:
multiline.parser cri
I think there is an issue in the flushing part for multiline parser. With gdb, I got this backtrace:
#0 __memmove_avx_unaligned_erms ()
at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:384
#1 0x0000555555688943 in msgpack_sbuffer_write (data=0x7ffff61f4280,
buf=0x7ffff4aee010 "0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u00"..., len=1968975920)
at /tmpdir/fluent-bit/lib/msgpack-c/include/msgpack/sbuffer.h:88
#2 0x0000555555688715 in msgpack_pack_str_body (x=0x7ffff61f4270,
b=0x7ffff4aee010, l=1968975920)
at /tmpdir/fluent-bit/lib/msgpack-c/include/msgpack/pack_template.h:784
#3 0x000055555568a8e4 in flb_ml_flush_stream_group (ml_parser=0x7ffff663e070,
mst=0x7ffff543cf50, group=0x7ffff49e4180)
at /tmpdir/fluent-bit/src/multiline/flb_ml.c:1012
#4 0x0000555555688c47 in flb_ml_flush_parser_instance (ml=0x7ffff5452150,
parser_i=0x7ffff5441f00, stream_id=17521209660683586930)
at /tmpdir/fluent-bit/src/multiline/flb_ml.c:117
#5 0x0000555555689a33 in flb_ml_append (ml=0x7ffff5452150,
stream_id=951830318436279800, type=0, tm=0x7ffff61f4490,
buf=0x7ffff5527d40, size=182)
at /tmpdir/fluent-bit/src/multiline/flb_ml.c:617
#6 0x00005555556c7532 in process_content (file=0x7ffff542f080,
bytes=0x7ffff61f45e8) at /tmpdir/fluent-bit/plugins/in_tail/tail_file.c:375
#7 0x00005555556c9a60 in flb_tail_file_chunk (file=0x7ffff542f080)
at /tmpdir/fluent-bit/plugins/in_tail/tail_file.c:1210
#8 0x00005555556c1f0f in in_tail_collect_event (file=0x7ffff542f080,
config=0x7ffff6632f80) at /tmpdir/fluent-bit/plugins/in_tail/tail.c:261
#9 0x00005555556c30e8 in tail_fs_event (ins=0x7ffff6648f00,
config=0x7ffff6632f80, in_context=0x7ffff5456000)
at /tmpdir/fluent-bit/plugins/in_tail/tail_fs_inotify.c:268
#10 0x0000555555669c67 in flb_input_collector_fd (fd=27, config=0x7ffff6632f80)
at /tmpdir/fluent-bit/src/flb_input.c:1108
#11 0x000055555567c8bc in flb_engine_handle_event (config=0x7ffff6632f80,
mask=1, fd=27) at /tmpdir/fluent-bit/src/flb_engine.c:411
#12 flb_engine_start (config=0x7ffff6632f80)
at /tmpdir/fluent-bit/src/flb_engine.c:696
#13 0x000055555565e83f in flb_lib_worker (data=0x7ffff660b340)
at /tmpdir/fluent-bit/src/flb_lib.c:627
#14 0x00007ffff7fb1fa3 in start_thread (arg=<optimized out>)
at pthread_create.c:486
#15 0x00007ffff76924cf in clone ()
I added one debug log in es plugin: https://github.com/GuillaumeSmaha/fluent-bit/commit/5b908dd8edb0a6332718ba2e5460892e7929a9a1 which should not occur because from what I am understanding it means even with buffer resizing, it is still inferior to the requirement size used in the memcpy
at the end of the fuction: https://github.com/fluent/fluent-bit/blob/3197e975143b3982d5ce214996e3595f6f12bab3/plugins/out_es/es_bulk.c#L105.
And I got a segv just after:
[1m[[0m2021/12/08 11:55:05[1m][0m [[93mdebug[0m] [out_es] converted_size is 0
[1m[[0m2021/12/08 11:55:05[1m][0m [[93mdebug[0m] [out_es] required size (197133, append: 4096) is still superior to available buffer (37999, previous available: 33903)
Thread 2 "flb-pipeline" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff61ff700 (LWP 45)]
__memmove_avx_unaligned_erms ()
at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:262
262 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0 __memmove_avx_unaligned_erms ()
at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:262
#1 0x00005555556fc274 in es_bulk_append (bulk=0x7ffff5400d60,
index=0x7ffff4ac0a40 "{\"create\":{\"_index\":\"logstash-2021.12.08\",\"_type\":\"_doc\",\"_id\":\"a76df76b-d2c7-9d2b-e172-93e0ec3c3d4f\"}}\n", i_len=104,
json=0x7ffff253fad0 "{\"@timestamp\":\"2021-12-08T11:54:07.027Z\",\"time\":\"2021-12-08T11:54:07.027621077Z\",\"stream\":\"stdout\",\"_p\":\"P\",\"log\":\"\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\"..., j_len=196967, whole_size=33903, converted_size=0)
at /tmpdir/fluent-bit/plugins/out_es/es_bulk.c:111
#2 0x00005555556f9af9 in elasticsearch_format (config=0x7ffff6632f80,
ins=0x7ffff6648f00, plugin_context=0x7ffff54a9600, flush_ctx=0x0,
tag=0x7ffff54a2700 "kube.var.log.containers.toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588_toolkit-jobs_toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588-1-db65750898be03ec7ef665854d8036800d9609adeb19a0cc5b021"...,
tag_len=215, data=0x7ffff2341f00, bytes=33903, out_data=0x7ffff4ac0f60,
out_size=0x7ffff4ac0f58) at /tmpdir/fluent-bit/plugins/out_es/es.c:537
#3 0x00005555556fa8c5 in cb_es_flush (data=0x7ffff2341f00, bytes=33903,
tag=0x7ffff54a2700 "kube.var.log.containers.toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588_toolkit-jobs_toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588-1-db65750898be03ec7ef665854d8036800d9609adeb19a0cc5b021"...,
tag_len=215, ins=0x7ffff6648f00, out_context=0x7ffff54a9600,
config=0x7ffff6632f80) at /tmpdir/fluent-bit/plugins/out_es/es.c:772
--Type <RET> for more, q to quit, c to continue without paging--
#4 0x000055555566b2df in output_pre_cb_flush ()
at /tmpdir/fluent-bit/include/fluent-bit/flb_output.h:514
#5 0x0000555555b69d47 in co_init ()
at /tmpdir/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6 0x0000000000000000 in ?? ()
(gdb)
@GuillaumeSmaha great digging, thank you.
@bstolz I will create another issue because the issue found is inside plugin "elasticsearch output". Reverting https://github.com/fluent/fluent-bit/pull/3788 seems to fix my issue (fluentbit doesn't crash anymore) Maybe, there is another issue with the multiline parser (with the backtrace I got here https://github.com/fluent/fluent-bit/issues/4195#issuecomment-988730077)
I just created https://github.com/fluent/fluent-bit/issues/4412
@bstolz My fix on es output was merged on master
and branch 1.8
. I am not sure it will fix your issue but you can try
@GuillaumeSmaha wonderful! Thank you!
That was quick ;-) I appreciate you diving in and posting that example.
I have tried fluent-bit versions 1.7.x and 1.8.x and they all crash using tail input and elastic output with heavy JSON formatted logs. I tried the change in commit 0a061fb. Fluent-bit was a bit more stable but still crashed. I have gone back to version 1.6.10 which "just runs".
I have noticed another issue with the 1.8.x series, it is continually logging errors on writing to elastic, but the "status": element in the elastic response only ever shows status 201 which should be "document created" and not an error. When I see errors reported with 1.6 I always see a real error status in the response, like a 429 for example. I think that the elastic response handling has also been broken.
@fhitchen can you try on the latest 1.8.11 with this change: https://github.com/fluent/fluent-bit/pull/4361 (1.8.11 already includes a change like https://github.com/fluent/fluent-bit/commit/0a061fbb73922dee737bc50837ca7a3fc7b3001c)
So I took the nokute78:fix_4311_es_fix and it ran for a few minutes before SEGV-ing. I'm back to 1.6.20.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
Hi, I just built and tested the 1.8.13 fluent-bit release and it crashed after processing for about 5 minutes. Again 1.6.20 works flawlessly so something has been badly broken in the elastic plugin memory management since 1.7. It does not appear to be related to the input data, I have tried reprocessing the same file and it does not crash in the same place. If I replace the output with STDOUT it does not crash.
Fluent Bit v1.8.13
* Copyright (C) 2015-2021 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/03/21 14:59:12] [engine] caught signal (SIGSEGV)
#0 0x7ffff6647fbe in ???() at ???:0
#1 0x44dde2 in flb_free() at include/fluent-bit/flb_mem.h:122
#2 0x44ef11 in flb_sds_destroy() at src/flb_sds.c:408
#3 0x4e1dfb in elasticsearch_format() at plugins/out_es/es.c:540
#4 0x4e2b6b in cb_es_flush() at plugins/out_es/es.c:772
#5 0x456a08 in output_pre_cb_flush() at include/fluent-bit/flb_output.h:517
#6 0x787905 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
Segmentation fault
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.
Similar issue or related issue #3585:
$ kubectl get pods -n amazon-cloudwatch
NAME READY STATUS RESTARTS AGE
fluent-bit-8t7fk 0/1 CrashLoopBackOff 6 (2m42s ago) 8m29s
fluent-bit-98w2b 0/1 CrashLoopBackOff 6 (2m34s ago) 8m29s
fluent-bit-chzjh 0/1 CrashLoopBackOff 6 (2m45s ago) 8m29s
fluent-bit-gfs7x 0/1 CrashLoopBackOff 6 (2m32s ago) 8m29s
fluent-bit-szn69 0/1 CrashLoopBackOff 6 (2m38s ago) 8m29s
fluent-bit-wxbkm 0/1 CrashLoopBackOff 6 (2m46s ago) 8m29s
$ kubectl logs fluent-bit-wxbkm -n amazon-cloudwatch
Fluent Bit v1.9.10
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2023/09/06 16:58:31] [ info] [fluent bit] version=1.9.10, commit=a2eaf59628, pid=1
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693618299.930665963.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693618299.938087318.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693655545.974952011.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693681220.89964544.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693681223.693057732.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693723157.518142203.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693731509.684250129.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693744367.53739593.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693771245.718195391.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693864172.706175934.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693866941.88886208.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693914041.952195655.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693972809.563850435.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1694017319.593944428.flb
[2023/09/06 16:58:31] [ info] [storage] version=1.4.0, type=memory+filesystem, sync=normal, checksum=disabled, max_chunks_up=128
[2023/09/06 16:58:31] [ info] [storage] backlog input plugin: storage_backlog.7
[2023/09/06 16:58:31] [ info] [cmetrics] version=0.3.7
[2023/09/06 16:58:31] [ info] [input:tail:tail.0] multiline core started
[2023/09/06 16:58:31] [ info] [input:tail:tail.1] multiline core started
[2023/09/06 16:58:31] [ info] [input:systemd:systemd.2] seek_cursor=s=2866bded7659498a96e0a8ad2beccf0a;i=7bf... OK
[2023/09/06 16:58:31] [ info] [input:tail:tail.3] multiline core started
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] queue memory limit: 4.8M
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] https=1 host=127.0.0.1 port=10250
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] token updated
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with Kubelet...
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2023/09/06 16:58:31] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] worker #0 started
[2023/09/06 16:58:31] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] worker #0 started
[2023/09/06 16:58:31] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] worker #0 started
[2023/09/06 16:58:31] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2023/09/06 16:58:31] [ info] [sp] stream processor started
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019169.378350213.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019170.248838268.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019187.497863650.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019215.460809884.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019256.480162041.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019339.487434952.flb
[2023/09/06 16:58:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=14152341 watch_fd=3 name=/var/log/containers/coredns-6ff9c46cd8-c2h4p_kube-system_coredns-3c90a2ab2fae2141af415dc72026004233e8de1f17edd6bbc2df9e68f607ed92.log
[2023/09/06 16:58:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=14152350 watch_fd=4 name=/var/log/containers/coredns-6ff9c46cd8-c2h4p_kube-system_coredns-4dec2ea945a6ee553ffd5ff906c33358f5bef079cf13809111db2f95cdc082cd.log
[2023/09/06 16:58:31] [engine] caught signal (SIGSEGV)
AWS for Fluent Bit Container Image Version 2.31.12.20230821
$ kubectl describe pod fluent-bit-8t7fk -n amazon-cloudwatch
Name: fluent-bit-8t7fk
Namespace: amazon-cloudwatch
Priority: 0
Service Account: fluent-bit
Node: ip-11-0-156-115.us-west-2.compute.internal/11.0.156.115
Start Time: Wed, 06 Sep 2023 09:52:48 -0700
Labels: controller-revision-hash=6cb6db756b
k8s-app=fluent-bit
kubernetes.io/cluster-service=true
pod-template-generation=1
version=v1
Annotations: <none>
Status: Running
IP: 11.0.156.115
IPs:
IP: 11.0.156.115
Controlled By: DaemonSet/fluent-bit
Containers:
fluent-bit:
Container ID: containerd://69bacf029e31a28f243e051c72d49edf24416173b9e9c88e67d2d393cd775af4
Image: public.ecr.aws/aws-observability/aws-for-fluent-bit:latest
Image ID: public.ecr.aws/aws-observability/aws-for-fluent-bit@sha256:2e52d17a2f34197707dcfe27626cfa45c30ced37999004638768bf533cd7f444
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 139
Started: Wed, 06 Sep 2023 10:14:01 -0700
Finished: Wed, 06 Sep 2023 10:14:01 -0700
Ready: False
Restart Count: 9
Limits:
memory: 200Mi
Requests:
cpu: 500m
memory: 100Mi
Environment:
AWS_REGION: <set to the key 'logs.region' of config map 'fluent-bit-cluster-info'> Optional: false
CLUSTER_NAME: <set to the key 'cluster.name' of config map 'fluent-bit-cluster-info'> Optional: false
HTTP_SERVER: <set to the key 'http.server' of config map 'fluent-bit-cluster-info'> Optional: false
HTTP_PORT: <set to the key 'http.port' of config map 'fluent-bit-cluster-info'> Optional: false
READ_FROM_HEAD: <set to the key 'read.head' of config map 'fluent-bit-cluster-info'> Optional: false
READ_FROM_TAIL: <set to the key 'read.tail' of config map 'fluent-bit-cluster-info'> Optional: false
HOST_NAME: (v1:spec.nodeName)
HOSTNAME: fluent-bit-8t7fk (v1:metadata.name)
CI_VERSION: k8s/1.3.16
Mounts:
/fluent-bit/etc/ from fluent-bit-config (rw)
/run/log/journal from runlogjournal (ro)
/var/fluent-bit/state from fluentbitstate (rw)
/var/lib/docker/containers from varlibdockercontainers (ro)
/var/log from varlog (ro)
/var/log/dmesg from dmesg (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h2lml (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
fluentbitstate:
Type: HostPath (bare host directory volume)
Path: /var/fluent-bit/state
HostPathType:
varlog:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
varlibdockercontainers:
Type: HostPath (bare host directory volume)
Path: /var/lib/docker/containers
HostPathType:
fluent-bit-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: fluent-bit-config
Optional: false
runlogjournal:
Type: HostPath (bare host directory volume)
Path: /run/log/journal
HostPathType:
dmesg:
Type: HostPath (bare host directory volume)
Path: /var/log/dmesg
HostPathType:
kube-api-access-h2lml:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoExecute op=Exists
:NoSchedule op=Exists
node-role.kubernetes.io/master:NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 22m default-scheduler Successfully assigned amazon-cloudwatch/fluent-bit-8t7fk to ip-11-0-156-115.us-west-2.compute.internal
Normal Pulled 22m kubelet Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 160.524711ms (160.534769ms including waiting)
Normal Pulled 22m kubelet Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 180.919554ms (180.968511ms including waiting)
Normal Pulled 22m kubelet Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 166.16697ms (166.176944ms including waiting)
Normal Started 21m (x4 over 22m) kubelet Started container fluent-bit
Normal Pulled 21m kubelet Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 145.097853ms (145.109643ms including waiting)
Normal Pulling 21m (x5 over 22m) kubelet Pulling image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest"
Normal Created 21m (x5 over 22m) kubelet Created container fluent-bit
Normal Pulled 21m kubelet Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 153.296922ms (153.322186ms including waiting)
Warning BackOff 2m29s (x94 over 22m) kubelet
# fluent-bit.conf
fluent_bit_conf = """
[SERVICE]
Flush 5
Grace 30
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server ${HTTP_SERVER}
HTTP_Listen 0.0.0.0
HTTP_Port ${HTTP_PORT}
storage.path /var/fluent-bit/state/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 5M
@INCLUDE application-log.conf
@INCLUDE dataplane-log.conf
@INCLUDE host-log.conf
"""
# application-log.conf
application_log_conf = """
[INPUT]
Name tail
Tag application.<namespace_name>.<deployment_name>.log
Tag_Regex ^\/var\/log\/containers\/(?<pod_name>[^_]+)_(?<namespace_name>[^_]+)_(?<deployment_name>[^_]+)-(?<docker_id>[a-z0-9]{64})\.log$
Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
Path /var/log/containers/*.log
multiline.parser docker, cri
DB /var/fluent-bit/state/flb_container.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag application.*
Path /var/log/containers/cloudwatch-agent*
multiline.parser docker, cri
DB /var/fluent-bit/state/flb_cwagent.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name kubernetes
Match application.*
Kube_URL https://kubernetes.default.svc:443
Kube_Tag_Prefix application.
Regex_Parser k8s-custom-tag
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
Labels Off
Annotations Off
Use_Kubelet On
Kubelet_Port 10250
Buffer_Size 0
[OUTPUT]
Name cloudwatch_logs
Match application.*
region ${AWS_REGION}
log_group_name /${CLUSTER_NAME}/application
log_stream_prefix log-
auto_create_group true
extra_user_agent container-insights
"""
# dataplane-log.conf
dataplane_log_conf = """
[INPUT]
Name systemd
Tag dataplane.systemd.*
Systemd_Filter _SYSTEMD_UNIT=docker.service
Systemd_Filter _SYSTEMD_UNIT=containerd.service
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
DB /var/fluent-bit/state/systemd.db
Path /var/log/journal
Read_From_Tail ${READ_FROM_TAIL}
[INPUT]
Name tail
Tag dataplane.tail.*
Path /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
multiline.parser docker, cri
DB /var/fluent-bit/state/flb_dataplane_tail.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name modify
Match dataplane.systemd.*
Rename _HOSTNAME hostname
Rename _SYSTEMD_UNIT systemd_unit
Rename MESSAGE message
Remove_regex ^((?!hostname|systemd_unit|message).)*$
[FILTER]
Name aws
Match dataplane.*
imds_version v1
[OUTPUT]
Name cloudwatch_logs
Match dataplane.*
region ${AWS_REGION}
log_group_name /${CLUSTER_NAME}/dataplane
log_stream_prefix log-
auto_create_group true
extra_user_agent container-insights
"""
# host-log.conf
host_log_conf = """
[INPUT]
Name tail
Tag host.dmesg
Path /var/log/dmesg
Key message
DB /var/fluent-bit/state/flb_dmesg.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag host.messages
Path /var/log/messages
Parser syslog
DB /var/fluent-bit/state/flb_messages.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag host.secure
Path /var/log/secure
Parser syslog
DB /var/fluent-bit/state/flb_secure.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name aws
Match host.*
imds_version v1
[OUTPUT]
Name cloudwatch_logs
Match host.*
region ${AWS_REGION}
log_group_name /${CLUSTER_NAME}/host
log_stream_prefix log-
auto_create_group true
extra_user_agent container-insights
"""
# parsers.conf
parsers_conf = r"""
[PARSER]
Name syslog
Format regex
Regex ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
Time_Key time
Time_Format %b %d %H:%M:%S
[PARSER]
Name container_firstline
Format regex
Regex (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
Name cwagent_firstline
Format regex
Regex (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
Name k8s-custom-tag
Format regex
Regex ^(?<namespace_name>[^.]+)\.(?<deployment_name>[^.]+)\.log$
"""
https://github.com/omidraha/pulumi_example/blob/main/cw/cw.py https://github.com/omidraha/pulumi_example/blob/main/cw/setup.py
Bug Report
Describe the bug Over time I have observed that our fluent bit pods running 1.8.4 occasionally show pod restarts due to termination (kubernetes exit code 139). Interestingly, the restarts seem to only happen on a few pods in the daemonset:
To Reproduce
There doesn't seem to be anything specific that we do other than just running for a period of time.
Expected behavior I suppose in an ideal world this error would not happen.
Your Environment
Additional context
It doesn't seem to affect us other than tripping our restart monitors. We're not sure why it's restarting (eg: is it something we're doing?)