fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.65k stars 1.54k forks source link

Fluent Bit v1.5.0 stops outputting logs to Elasticsearch #2416

Closed huanggze closed 3 years ago

huanggze commented 4 years ago

Bug Report

Describe the bug Fluent Bit stops outputting logs to Elasticsearch. By the last log record, it seems to stop at 2020/08/03 10:31:01.

[2020/08/03 07:08:24] [ warn] [input] tail.0 paused (mem buf overlimit)
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438486.926616801.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438487.744638715.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438488.159657954.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [input] tail.0 resume (mem buf overlimit)
[2020/08/03 07:08:25] [ warn] [input] tail.0 paused (mem buf overlimit)
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438487.762963871.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438487.980200994.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438487.884800120.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438488.605624992.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [input] tail.0 resume (mem buf overlimit)
[2020/08/03 07:08:25] [ warn] [input] tail.0 paused (mem buf overlimit)
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438487.959545222.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438487.867642032.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438488.387166083.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [input] tail.0 resume (mem buf overlimit)
[2020/08/03 07:08:25] [ warn] [input] tail.0 paused (mem buf overlimit)
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438488.481110928.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [input] tail.0 resume (mem buf overlimit)
[2020/08/03 07:08:25] [ warn] [input] tail.0 paused (mem buf overlimit)
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438488.270473808.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [input] tail.0 resume (mem buf overlimit)
[2020/08/03 07:08:25] [ warn] [input] tail.0 paused (mem buf overlimit)
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438487.502466470.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [input] tail.0 resume (mem buf overlimit)
[2020/08/03 07:08:25] [ warn] [input] tail.0 paused (mem buf overlimit)
[2020/08/03 07:08:25] [ info] [engine] flush chunk '13-1596438488.322912790.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:25] [ info] [input] tail.0 resume (mem buf overlimit)
[2020/08/03 07:08:25] [ warn] [input] tail.0 paused (mem buf overlimit)
[2020/08/03 07:08:26] [ info] [engine] flush chunk '13-1596438488.203691480.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:26] [ info] [engine] flush chunk '13-1596438487.100592593.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:26] [ info] [engine] flush chunk '13-1596438486.905915646.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:26] [ info] [engine] flush chunk '13-1596438486.962462551.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:26] [ info] [input] tail.0 resume (mem buf overlimit)
[2020/08/03 07:08:26] [ info] [engine] flush chunk '13-1596438488.694434043.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:26] [ warn] [input] tail.0 paused (mem buf overlimit)
[2020/08/03 07:08:26] [ info] [engine] flush chunk '13-1596438488.509999400.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:26] [ info] [input] tail.0 resume (mem buf overlimit)
[2020/08/03 07:08:26] [ info] [engine] flush chunk '13-1596438488.412243440.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:26] [ warn] [input] tail.0 paused (mem buf overlimit)
[2020/08/03 07:08:26] [ info] [engine] flush chunk '13-1596438486.813759156.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:26] [ info] [input] tail.0 resume (mem buf overlimit)
[2020/08/03 07:08:26] [ warn] [input] tail.0 paused (mem buf overlimit)
[2020/08/03 07:08:26] [ info] [input] tail.0 resume (mem buf overlimit)
[2020/08/03 07:08:26] [ warn] [input] tail.0 paused (mem buf overlimit)
[2020/08/03 07:08:26] [ info] [engine] flush chunk '13-1596438488.450451272.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:26] [ info] [input] tail.0 resume (mem buf overlimit)
[2020/08/03 07:08:26] [ info] [engine] flush chunk '13-1596438486.990441151.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:26] [ info] inotify_fs_add(): inode=3545357 watch_fd=109 name=/var/log/containers/istio-pilot-67fd55d974-hz9vd_istio-system_discovery-2290792dd6b85fff351db8bbd7cae240f525d1a6b6db534b1c8541aa10f08a64.log
[2020/08/03 07:08:26] [ info] [engine] flush chunk '13-1596438487.471473555.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.1
[2020/08/03 07:08:26] [ info] inotify_fs_add(): inode=1706454 watch_fd=110 name=/var/log/containers/calico-node-x6jcr_kube-system_calico-node-0846ae859249e037ec63bdac1b81fdaf2bf00050c07cdc0608aed8d1d4399a19.log
[2020/08/03 07:08:36] [ info] inotify_fs_add(): inode=3545389 watch_fd=111 name=/var/log/containers/istio-telemetry-57fc886bf8-vzdhv_istio-system_mixer-5aee042c6495dd405d161b7a207a3339091705376ae52641416820a0ec988c30.log

...A bunch of inotify_fs_add logs...

[2020/08/03 08:46:26] [ info] [input:tail:tail.2] inode=3412842 handle rotation(): /var/log/containers/ks-events-exporter-f477d7595-7q7rt_kubesphere-logging-system_events-exporter-d75a30109785cb0e2544854027721e2bd461ce17bd9e5907d2fc0a86bbd95789.log => /var/lib/docker/containers/d75a30109785cb0e2544854027721e2bd461ce17bd9e5907d2fc0a86bbd95789/d75a30109785cb0e2544854027721e2bd461ce17bd9e5907d2fc0a86bbd95789-json.log.1
[2020/08/03 08:46:26] [ info] inotify_fs_remove(): inode=3412842 watch_fd=1
[2020/08/03 08:46:26] [ info] inotify_fs_add(): inode=3412842 watch_fd=2 name=/var/lib/docker/containers/d75a30109785cb0e2544854027721e2bd461ce17bd9e5907d2fc0a86bbd95789/d75a30109785cb0e2544854027721e2bd461ce17bd9e5907d2fc0a86bbd95789-json.log.1
[2020/08/03 08:46:26] [ info] inotify_fs_add(): inode=3412812 watch_fd=3 name=/var/log/containers/ks-events-exporter-f477d7595-7q7rt_kubesphere-logging-system_events-exporter-d75a30109785cb0e2544854027721e2bd461ce17bd9e5907d2fc0a86bbd95789.log
[2020/08/03 08:46:31] [ info] inotify_fs_remove(): inode=3412842 watch_fd=2
[2020/08/03 08:51:46] [error] [src/flb_http_client.c:1077 errno=32] Broken pipe
[2020/08/03 08:51:46] [ warn] [output:es:es.0] http_do=-1 URI=/_bulk
[2020/08/03 08:51:46] [ warn] [engine] failed to flush chunk '13-1596444702.499778635.flb', retry in 10 seconds: task_id=1, input=tail.2 > output=es.0
[2020/08/03 08:51:56] [ info] [engine] flush chunk '13-1596444702.499778635.flb' succeeded at retry 1: task_id=1, input=tail.2 > output=es.0
[engine] caught signal (SIGSEGV)
[2020/08/03 10:31:01] [error] [src/flb_http_client.c:1077 errno=32] Broken pipe
[2020/08/03 10:31:01] [ warn] [output:es:es.0] http_do=-1 URI=/_bulk

Expected behavior

Screenshots

Your Environment

[Service] Parsers_File parsers.conf [Input] Name tail Path /var/log/containers/.log Exclude_Path /var/log/containers/_kubesphere-logging-system_events-exporter.log,/var/log/containers/kube-auditing-webhook_kubesphere-logging-system_kube-auditing-webhook.log Refresh_Interval 10 Skip_Long_Lines true DB /fluent-bit/tail/pos.db DB.Sync Normal Mem_Buf_Limit 5MB Parser docker Tag kube. [Input] Name tail Path /var/log/containers/kube-auditing-webhook_kubesphere-logging-system_kube-auditing-webhook.log Refresh_Interval 10 Skip_Long_Lines true DB /fluent-bit/tail/pos-auditing.db DB.Sync Normal Mem_Buf_Limit 5MB Parser docker Tag kube_auditing [Input] Name tail Path /var/log/containers/_kubesphere-logging-system_events-exporter.log Refresh_Interval 10 Skip_Long_Lines true DB /fluent-bit/tail/pos-events.db DB.Sync Normal Mem_Buf_Limit 5MB Parser docker Tag kube_events [Filter] Name kubernetes Match kube. Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Labels false Annotations false [Filter] Name nest Match kube. Operation lift Nested_under kubernetes Addprefix kubernetes [Filter] Name modify Match kube. Remove stream Remove kubernetes_pod_id Remove kubernetes_host Remove kubernetes_container_hash [Filter] Name nest Match kube. Operation nest Wildcard kubernetes_ Nest_under kubernetes Removeprefix kubernetes [Filter] Name parser Match kube_auditing Key_Name log Parser json [Filter] Name modify Match kube_auditing Condition Key_does_not_exist AuditID
Add ignore true [Filter] Name modify Match kube_auditing Remove Trace [Filter] Name grep Match kube_auditing Exclude ignore true [Filter] Name parser Match kube_events Key_Name log Parser json [Output] Name es Match kube_events Host elasticsearch-logging-data.kubesphere-logging-system.svc Port 9200 Logstash_Format true Logstash_Prefix ks-logstash-events [Output] Name es Match kube.
Host elasticsearch-logging-data.kubesphere-logging-system.svc Port 9200 Logstash_Format true Logstash_Prefix ks-logstash-log Time_Key @timestamp [Output] Name es Match kube_auditing Host elasticsearch-logging-data.kubesphere-logging-system.svc Port 9200 Logstash_Format true Logstash_Prefix ks-logstash-auditing


* Environment name and version (e.g. Kubernetes? What version?): 
* Server type and version: Kubernetes v1.17.9
* Operating System and version: Linux ks-allinone 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
* Filters and plugins:

**Additional context**
<!--- How has this issue affected you? What are you trying to accomplish? -->
<!--- Providing context helps us come up with a solution that is most useful in the real world -->
rumanzo commented 4 years ago

The same with version 1.5.2 Build options:

-DFLB_JEMALLOC=Yes \
-DFLB_TLS=Yes \
-DFLB_STREAM_PROCESSOR=No \
-DFLB_EXAMPLES=No \
-DFLB_SIGNV4=No \
-DFLB_PROXY_GO=No \
-DFLB_RECORD_ACCESSOR=Off \
-DFLB_AWS=No \
-DFLB_DEV=Yes \
-DFLB_METRICS=Yes \
-DFLB_HTTP_SERVER=Yes \
-DFLB_INOTIFY=Yes \
-DFLB_IN_COLLECTD=No \
-DFLB_IN_CPU=No \
-DFLB_IN_DISK=No \
-DFLB_IN_DOCKER=No \
-DFLB_IN_DOCKER_EVENTS=No \
-DFLB_IN_DUMMY=No \
-DFLB_IN_EMITTER=No \
-DFLB_IN_EXEC=No \
-DFLB_IN_HEAD=No \
-DFLB_IN_HEALTH=No \
-DFLB_IN_HTTP=No \
-DFLB_IN_KMSG=No \
-DFLB_IN_LIB=No \
-DFLB_IN_MQTT=No \
-DFLB_IN_NETIF=No \
-DFLB_IN_PROC=No \
-DFLB_IN_RANDOM=No \
-DFLB_IN_SERIAL=No \
-DFLB_IN_STATSD=No \
-DFLB_IN_STDIN=No \
-DFLB_IN_STORAGE_BACKLOG=No \
-DFLB_IN_SYSTEMD=No \
-DFLB_IN_TAIL=Yes \
-DFLB_IN_TCP=No \
-DFLB_IN_THERMAL=No \
-DFLB_IN_WINLOG=No \
-DFLB_OUT_ES=Yes \
-DFLB_OUT_FORWARD=Yes \
-DFLB_OUT_AZURE=No \
-DFLB_OUT_BIGQUERY=No \
-DFLB_OUT_COUNTER=No \
-DFLB_OUT_DATADOG=No \
-DFLB_OUT_EXIT=No \
-DFLB_OUT_GELF=No \
-DFLB_OUT_HTTP=No \
-DFLB_OUT_INFLUXDB=No \
-DFLB_OUT_NATS=No \
-DFLB_OUT_NRLOGS=No \
-DFLB_OUT_TCP=No \
-DFLB_OUT_PLOT=No \
-DFLB_OUT_FILE=No \
-DFLB_OUT_TD=No \
-DFLB_OUT_RETRY=No \
-DFLB_OUT_PGSQL=No \
-DFLB_OUT_SLACK=No \
-DFLB_OUT_SPLUNK=No \
-DFLB_OUT_STACKDRIVER=No \
-DFLB_OUT_STDOUT=No \
-DFLB_OUT_SYSLOG=No \
-DFLB_OUT_LIB=No \
-DFLB_OUT_NULL=No \
-DFLB_OUT_FLOWCOUNTER=No \
-DFLB_OUT_LOGDNA=No \
-DFLB_OUT_KAFKA=No \
-DFLB_OUT_KAFKA_REST=No \
-DFLB_OUT_CLOUDWATCH_LOGS=No \
-DFLB_FILTER_KUBERNETES=Yes \
-DFLB_FILTER_ALTER_SIZE=No \
-DFLB_FILTER_AWS=No \
-DFLB_FILTER_EXPECT=No \
-DFLB_FILTER_GREP=No \
-DFLB_FILTER_MODIFY=No \
-DFLB_FILTER_STDOUT=No \
-DFLB_FILTER_PARSER=No \
-DFLB_FILTER_REWRITE_TAG=No \
-DFLB_FILTER_THROTTLE=No \
-DFLB_FILTER_THROTTLE_SIZE=No \
-DFLB_FILTER_NEST=No \
-DFLB_FILTER_LUA=No \
-DFLB_FILTER_RECORD_MODIFIER=No \

Last log with log_level trace:

[2020/08/06 17:09:13] [debug] [upstream] KA connection #233 to elkserver:9200 has been assigned (recycled)
[2020/08/06 17:09:13] [trace] [io thread=0x7fb45bc7c0a0] [net_write] trying 130 bytes
[2020/08/06 17:09:13] [trace] [io thread=0x7fb45bc7c0a0] [fd 233] write_async(2)=-1 (0/130)
[2020/08/06 17:09:13] [trace] [io thread=0x7fb45bc7c0a0] [net_write] ret=-1 total=0/130
[2020/08/06 17:09:13] [error] [src/flb_http_client.c:1077 errno=25] Inappropriate ioctl for device
[2020/08/06 17:09:13] [ warn] [output:es:es.0] http_do=-1 URI=/_bulk
[2020/08/06 17:09:13] [trace] [upstream] destroy connection #-1 to elkserver:9200
[2020/08/06 17:09:13] [trace] [engine] resuming thread=0x7fb45bc7a6c0
[2020/08/06 17:09:13] [trace] [io thread=0x7fb45bc7a6c0] [fd 228] write_async(2)=-1 (99912/3403780)
[2020/08/06 17:09:13] [trace] [io thread=0x7fb45bc7a6c0] [net_write] ret=-1 total=0/3403780
[2020/08/06 17:09:13] [error] [src/flb_http_client.c:1086 errno=25] Inappropriate ioctl for device
[2020/08/06 17:09:13] [ warn] [output:es:es.0] http_do=-1 URI=/_bulk
[2020/08/06 17:09:13] [trace] [upstream] destroy connection #-1 to elkserver:9200
[engine] caught signal ([2020/08/06 17:09:13] [debug] [upstream] KA connection #-1 to elkserver:9200 has been disconnected by the remote service
SIGSEGV)
[2020/08/06 17:09:13] [trace] [upstream] destroy connection #-1 to elkserver:9200
ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)
kenjapkowalczyk commented 4 years ago

Same issue on Fluent Bit v1.5.2

edsiper commented 3 years ago

I think the sigsegv reported is a known bug that was already fixed in the latest version, please upgrade to v1.5.6:

https://fluentbit.io/announcements/v1.5.6/

rumanzo commented 3 years ago

Not everything is fixed. We updated to the latest version (1.5.6), fluentbit stops sending logs suddenly and without any logs with log_level debug. Just stops the activity and that's it

edsiper commented 3 years ago

just got some info that this is hitting the news. The original root cause of the issue was fixed long time ago. For any other ingestion issue let's open a new ticket so we can troubleshoot.

(my bad for not closing the ticket while back in Sep 2020)