Closed jsravn closed 6 years ago
hmm looks like related to Elasticsearch. output plugin.
Hi. I have this "error" too:
$ uname -a Linux ott-1 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue Ubuntu 16.04.1 LTS
Fluent-Bit v0.12.8
Copyright (C) Treasure Data[2017/11/28 10:07:57] [ info] Configuration
[2017/11/28 10:07:57] [ info] flush time : 3 seconds [2017/11/28 10:07:57] [ info] input plugins : [2017/11/28 10:07:57] [ info] tail [2017/11/28 10:07:57] [ info] tail [2017/11/28 10:07:57] [ info] tail [2017/11/28 10:07:57] [ info] collectors : "took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors"....
when shutdown and every 15 minnutes
Hi, I have the same problem few minutes after restart and part of logs were not imported to elasticsearch. No errors in elasticsearch logs. 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Dec 05 14:19:14 newhope-LOCUST2 td-agent-bit[13024]: Fluent-Bit v0.12.9
Dec 05 14:19:14 newhope-LOCUST2 td-agent-bit[13024]: Copyright (C) Treasure Data
Dec 05 14:19:14 newhope-LOCUST2 td-agent-bit[13024]: "took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""er
Dec 05 14:24:18 newhope-LOCUST2 td-agent-bit[13024]: k""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors
Dec 05 14:24:18 newhope-LOCUST2 td-agent-bit[13024]: rrors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""to
Config:
[SERVICE]
Flush 1
Log_Level info
Daemon off
Parsers_File parsers.conf
[INPUT]
Name systemd
Tag services.*
Systemd_Filter _SYSTEMD_UNIT=locust-master.service
Systemd_Filter _SYSTEMD_UNIT=locust-slave@1.service
Buffer_Chunk_Size 400k
[OUTPUT]
Name es-services
Match services.*
Host 10.91.118.85
Port 9200
Logstash_Format On
Logstash_Prefix logstash
Retry_Limit False```
Hello I'm also facing this issue using ES output.
Is this garbage reproducible with latest v0.12.13 ?
v0.12.14 also produce the messages. There is a details on below.
kubectl logs fluentbit-mzs9k
+ exec /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf [2018/02/19 02:37:35] [ info] [engine] started [2018/02/19 02:37:35] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443 [2018/02/19 02:37:35] [ info] [filter_kube] local POD info OK [2018/02/19 02:37:35] [ info] [filter_kube] testing connectivity with API server... [2018/02/19 02:37:35] [error] [filter_kube] could not get meta for POD taco-aio Fluent-Bit v0.12.14 Copyright (C) Treasure Data "took""errors"............."took""errors""too2018-02-19T02:47:44.134022547Z k""errors""took"........."errors""took""e2018-02-19T02:51:50.062630753Z rrors""took"........"errors""took""error2018-02-19T02:55:57.010085357Z
We faced the same issues, I've builded fluent-bit with debugging symbols and here is what happened:
"took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors"init vars finishing
fluent-bit: /tmp/src/lib/monkey/deps/flb_libco/amd64.c:121: crash: Assertion `0' failed.
Aborted (core dumped)
core file analysis:
(gdb) thread apply all bt
Thread 6 (Thread 0x7ff5e09ff700 (LWP 26)):
#0 0x00007ff5e3a1824d in nanosleep () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007ff5e3a1819a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x0000559ba72c0cd1 in mk_clock_worker_init (data=0x7ff5e261c000) at /tmp/src/lib/monkey/mk_server/mk_clock.c:112
#3 0x00007ff5e4ae3494 in start_thread (arg=0x7ff5e09ff700) at pthread_create.c:333
#4 0x00007ff5e3a48aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
Thread 5 (Thread 0x7ff5e01fe700 (LWP 27)):
#0 0x00007ff5e3a490f3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
#1 0x0000559ba72c7e2b in _mk_event_wait (loop=0x7ff5df622000) at /tmp/src/lib/monkey/mk_core/mk_event_epoll.c:340
#2 0x0000559ba72c811b in mk_event_wait (loop=0x7ff5df622000) at /tmp/src/lib/monkey/mk_core/mk_event.c:163
#3 0x0000559ba72c1d86 in mk_server_worker_loop (server=0x7ff5e261c000) at /tmp/src/lib/monkey/mk_server/mk_server.c:435
#4 0x0000559ba72b93ba in mk_sched_launch_worker_loop (data=0x7ff5e0a0d008) at /tmp/src/lib/monkey/mk_server/mk_scheduler.c:411
#5 0x00007ff5e4ae3494 in start_thread (arg=0x7ff5e01fe700) at pthread_create.c:333
#6 0x00007ff5e3a48aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
Thread 4 (Thread 0x7ff5e1dfe700 (LWP 24)):
#0 0x00007ff5e3a1824d in nanosleep () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007ff5e3a1819a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2 0x0000559ba70a21c6 in time_ticker (args=0x7ff5e262c030) at /tmp/src/plugins/filter_throttle/throttle.c:83
#3 0x00007ff5e4ae3494 in start_thread (arg=0x7ff5e1dfe700) at pthread_create.c:333
#4 0x00007ff5e3a48aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
Thread 3 (Thread 0x7ff5e15fd700 (LWP 25)):
#0 0x00007ff5e3a490f3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
#1 0x0000559ba72c7e2b in _mk_event_wait (loop=0x7ff5e262c318) at /tmp/src/lib/monkey/mk_core/mk_event_epoll.c:340
#2 0x0000559ba72c811b in mk_event_wait (loop=0x7ff5e262c318) at /tmp/src/lib/monkey/mk_core/mk_event.c:163
#3 0x0000559ba72b13cc in mk_lib_worker (data=0x7ff5e262c330) at /tmp/src/lib/monkey/mk_server/mk_lib.c:150
#4 0x00007ff5e4ae3494 in start_thread (arg=0x7ff5e15fd700) at pthread_create.c:333
#5 0x00007ff5e3a48aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
Thread 2 (Thread 0x7ff5e25ff700 (LWP 23)):
#0 0x00007ff5e3a490f3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
#1 0x0000559ba72c7e2b in _mk_event_wait (loop=0x7ff5e262c018) at /tmp/src/lib/monkey/mk_core/mk_event_epoll.c:340
#2 0x0000559ba72c811b in mk_event_wait (loop=0x7ff5e262c018) at /tmp/src/lib/monkey/mk_core/mk_event.c:163
---Type <return> to continue, or q <return> to quit---
#3 0x0000559ba70430f4 in log_worker_collector (data=0x7ff5e266e060) at /tmp/src/src/flb_log.c:122
#4 0x0000559ba7053039 in step_callback (data=0x7ff5e2677070) at /tmp/src/src/flb_worker.c:50
#5 0x00007ff5e4ae3494 in start_thread (arg=0x7ff5e25ff700) at pthread_create.c:333
#6 0x00007ff5e3a48aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
Thread 1 (Thread 0x7ff5e4e82d00 (LWP 22)):
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ff5e39943fa in __GI_abort () at abort.c:89
#2 0x00007ff5e398be37 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x559ba7370fa3 "0",
file=file@entry=0x559ba7370f78 "/tmp/src/lib/monkey/deps/flb_libco/amd64.c", line=line@entry=121,
function=function@entry=0x559ba7370fa5 <__PRETTY_FUNCTION__.2983> "crash") at assert.c:92
#3 0x00007ff5e398bee2 in __GI___assert_fail (assertion=0x559ba7370fa3 "0", file=0x559ba7370f78 "/tmp/src/lib/monkey/deps/flb_libco/amd64.c", line=121,
function=0x559ba7370fa5 <__PRETTY_FUNCTION__.2983> "crash") at assert.c:101
#4 0x0000559ba72ca32a in crash () at /tmp/src/lib/monkey/deps/flb_libco/amd64.c:121
#5 0x0000000000000000 in ?? ()
Core file attached: https://www.dropbox.com/s/e844qewbr5s1i3q/core-fluent-bit?dl=0
fluent-bit compiled from master with ES index strftime patch introduced in the PR https://github.com/fluent/fluent-bit/pull/512
I've tried to reproduce the "garbage on exit" issue without success.
Would you please provide exact steps to reproduce it ?
Unfortunately I can't reproduce it reliably as well, it happens with reasonably high rate of messages (more than 2000 sec in our case) and es output plugin. This happens 1 time per 24 hours in average in our case. We have bunch of other fluent-bit instances with lower message rate, and this doesn't happen there.
I can try to gather all needed information, if you tell me what are you interested in? I can provide you with core file, or anything you request, if it will be possible.
We see this error with fluent-bit 0.13 from the official docker image fluent/fluent-bit-0.13-dev:0.13
. We have the Elasticsearch output plugin in use.
I can't find any evidence that this occurs with throughput spikes. CPU, memory and throughput have been on a baseline level. That means 230 log entries per minute, coming from 5 fluent-bit instances.
I have same issues with both v0.12.14 and v0.12.15. Running on OpenShift with external Elastic on 6.2. All pods in the cluster shows same outlput. Ex. oc logs fluent-bit-dbscm [2018/03/31 03:29:11] [ info] [engine] started [2018/03/31 03:29:11] [ info] [filter_kube] https=1 host=kubernetes.default.svc port=443 [2018/03/31 03:29:11] [ info] [filter_kube] local POD info OK [2018/03/31 03:29:11] [ info] [filter_kube] testing connectivity with API server... [2018/03/31 03:29:12] [ info] [filter_kube] API server connectivity OK Fluent-Bit v0.12.5 Copyright (C) Treasure Data
"took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took"
This affects all pods
I've tried to reproduce this problem without success.
When this happens, is there any special condition happening in the remote Elasticsearch server like a premature TCP close, network outage or similar ?
My guess is when there is too much retry on sending logs. Like wrong time format.
On Sat, 31 Mar 2018, 07:10 Eduardo Silva, notifications@github.com wrote:
I've tried to reproduce this problem without success.
When this happens, is there any special condition happening in the remote Elasticsearch server like a premature TCP close, network outage or similar ?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/429#issuecomment-377666615, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJrq6AZQ3o0wfZFLMSguEFlID-L_4jHks5tjw59gaJpZM4QbuRP .
does it happen only on shutdown or also on runtime ?
I don't know if it's shutdown, or this message result in a pod killing. But after this message, I have always a restart
On Sun, 1 Apr 2018, 02:07 Eduardo Silva, notifications@github.com wrote:
does it happen only on shutdown or also on runtime ?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/429#issuecomment-377732509, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJrq76iBLuLal0JBVerEaJrU5cjJ9hGks5tkBo6gaJpZM4QbuRP .
I run more than 10 fluent-bit on k8s clusters. But the only one(pod name = fluentbit-x8ddc) generate this message. I'm not sure what is different. Once you've sorted out the situation, for some reason fluentbit generates garbage and records it as a log every three minutes.
@edsiper Using fluent/fluent-bit-0.13-dev:0.14
on certain pods only, I am getting the same error (e.g. 344 restarts over two days):
fluent-bit: /tmp/src/lib/monkey/deps/flb_libco/amd64.c:121: crash: Assertion `0' failed.
Fluent-Bit v0.13.0
Copyright (C) Treasure Data
I am using Fluent-bit without elasticsearch output, but with two http outputs configured.
For instance, instead of fluent-bit-configmap.yaml:
[OUTPUT]
Name http
Match kube.*
Host 127.0.0.1
Format json
Port 3000
URI /
[OUTPUT]
Name http
Match kube.*
Host 127.0.0.1
Format json
Port 4000
URI /
I made HTTP web-services that produced to Kafka and ElasticSearch to rule out those output plugins execution paths since I could not get meaningful error messages.
The logs aren't helpful, but the timing between startup and last success is very close to the three minutes mentioned by @intelliguy:
[2018/04/05 14:46:18] [ info] [engine] started
[2018/04/05 14:46:18] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/05 14:46:18] [ info] [filter_kube] local POD info OK
[2018/04/05 14:46:18] [ info] [filter_kube] testing connectivity with API server...
[2018/04/05 14:46:23] [ info] [filter_kube] API server connectivity OK
[2018/04/05 14:46:23] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
... skipping successes ...
[2018/04/05 14:49:02] [ info] [out_http] HTTP STATUS=200
@StevenACoffman @intelliguy @jgsqware
when connecting to elasticsearch, are you using plain HTTP communication or HTTPS (TLS) ?, which specific elasticsearch version are you (each one) using ?
I am not connecting to ElasticSearch at all, and still get the error. I am using plain HTTP to talk to another container I added to the pod. One HTTP port is a container that writes log files to AWS S3 storage. The other processes the logs and sends them on to Kafka.
Mine may be an unrelated problem, but the error message was very similar, so I thought I would mention it here in the hope that there was a common root cause and this was a clue you could use.
Is there any TLS involved or just plain HTTP?
On Thu, Apr 5, 2018, 17:54 Steve Coffman notifications@github.com wrote:
I am not connecting to ElasticSearch at all, and still get the error. I am using plain HTTP to talk to another container I added to the pod. One HTTP port is a container that writes log files to AWS S3 storage. The other processes the logs and sends them on to Kafka.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/429#issuecomment-379110170, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWkNuND6zXCDWrOaK8m9UHgJjGtCABpks5tlq63gaJpZM4QbuRP .
For us, it's plain http
On Fri, 6 Apr 2018, 02:05 Eduardo Silva, notifications@github.com wrote:
Is there any TLS involved or just plain HTTP?
On Thu, Apr 5, 2018, 17:54 Steve Coffman notifications@github.com wrote:
I am not connecting to ElasticSearch at all, and still get the error. I am using plain HTTP to talk to another container I added to the pod. One HTTP port is a container that writes log files to AWS S3 storage. The other processes the logs and sends them on to Kafka.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/fluent/fluent-bit/issues/429#issuecomment-379110170 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AAWkNuND6zXCDWrOaK8m9UHgJjGtCABpks5tlq63gaJpZM4QbuRP
.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/429#issuecomment-379111894, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJrqx8rWyGvxJ57O9Ppe12ku8DwPRvCks5tlrFbgaJpZM4QbuRP .
Plain HTTP for us, no TLS.
update: I am finally able to reproduce the garbage without stopping the service (kill -15 PID). Work in process.
I've found the guilty code:
https://github.com/fluent/fluent-bit/blob/master/plugins/out_es/es.c#L409
and it author:
$ git blame ../plugins/out_es/es.c -L409,+1
81320ba8a (Eduardo Silva 2017-07-13 22:41:51 -0600 409) msgpack_object_print(stdout, key);
It was a debug line that was not removed, it's not easy to catch since it does a fprintf() to the buffered stdout, so the message is only visible when the buffer is flushed by the OS.
Fixed by 02d9505e
All: if you see the message amd64.c:121: crash: Assertion `0' failed somewhere after this fix (in the new images to be available), please let me know ASAP. I am not confident that error is associated to the garbage message recently fixed..
@onorua
regarding the core file provided in Dropbox, do you have the exact fluent-bit binary that you built and was used to generate the core file ?
@edsiper In my configuration I am not using ElasticSearch output, and with the image fluent/fluent-bit-0.13-dev:0.15
(aka fluent/fluent-bit-0.13-dev@sha256:a4f22b6de839b5ca88e7092f8c238ecd983c1fc73288bdaabdf0ce67e6c3252c
) configured (as described above) to use only the http output plugin (twice) over HTTP (no TLS) to a container in the same pod, I just now got:
[2018/04/07 01:44:14] [ info] [out_http] HTTP STATUS=200
[2018/04/07 01:44:15] [ info] [out_http] HTTP STATUS=200
fluent-bit: /tmp/src/lib/monkey/deps/flb_libco/amd64.c:121: crash: Assertion `0' failed.
Fluent-Bit v0.13.0
Copyright (C) Treasure Data
[engine] caught signal
#0 0x7f6defdfa529 in ???() at ???:0
#1 0x7f6defdf1e66 in ???() at ???:0
#2 0x7f6defdf1f11 in ???() at ???:0
#3 0x55994d80c389 in ???() at ???:0
#4 0xffffffffffffffff in ???() at ???:0
Again, this may be a separate issue.
By the way, I am using this to receive the termination message:
kubectl logs -n logging fluent-bit-k9phx -c fluent-bit --previous
Awesome for the garbage log.
We received too this message
/tmp/src/lib/monkey/deps/flb_libco/amd64.c:121: crash: Assertion
0'
failed.`
On Sat, 7 Apr 2018, 04:02 Steve Coffman, notifications@github.com wrote:
By the way, I am using this to receive the termination message:
kubectl logs -n logging fluent-bit-k9phx -c fluent-bit --previous
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/429#issuecomment-379424511, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJrq106rUiaxbni8rXZF6j8bSfG3bKXks5tmB5GgaJpZM4QbuRP .
@jgsqware To clarify, you get that error message with fluent/fluent-bit-0.13-dev:0.15
, correct?
Also, are you using the ElasticSearch output plugin? Are you also using any other output plugins (such as HTTP)?
I've filed #557 to troubleshoot/fix the crash issue. Since the garbage problem is fixed and releases available I am closing this ticket:
Whenever I send a SIGTERM to fluent-bit, I get this output:
Seems harmless otherwise, but looks like some sort of buffer overrun or something.