Open hankwallace opened 2 years ago
[INPUT] Name forward unix_path /var/run/fluent.sock Mem_Buf_Limit 2MB
Just checking- so you are using this blog/example? https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention
Also are any logs successfully making it to Loki? Here it mostly looks like every request fails? This seems to be mainly failing network. Can you curl the loki endpoint from an instance inside the same subnet?
Since you indicated its crashing/stopping unexpectedly, this is the technique for getting a stack trace so we can take it to upstream and fix it: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#segfaults-and-crashes-sigsegv
Yes, I looked at that blog/example along with many (many, many) others.
Yes, some logs successfully are making it to Loki. In the example I pasted above the service was sending logs for about 90 minutes before it failed. Curling the loki endpoint still works. That service is one of > 10 on the same subnet and all of the others continued working even after this one failed. The time it takes to fail and the service that it fails on is not repeatable.
I'll see what I can do re: getting the stack trace.
I'll post more of the log when it crashes/exits, but on startup I see a bunch of "invalid read" and "invalid write" now.
-----------------------------------------------------------------------------------------------------------------------------------
| timestamp | message |
|---------------|-----------------------------------------------------------------------------------------------------------------|
| 1660676815625 | ==1== Memcheck, a memory error detector |
| 1660676815625 | ==1== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. |
| 1660676815625 | ==1== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info |
| 1660676815625 | ==1== Command: /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf |
| 1660676815625 | ==1== |
| 1660676816903 | [1mFluent Bit v1.8.8[0m |
| 1660676816906 | * [1m[93mCopyright (C) 2019-2021 The Fluent Bit Authors[0m |
| 1660676816908 | * [1m[93mCopyright (C) 2015-2018 Treasure Data[0m |
| 1660676816908 | * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd |
| 1660676816908 | * https://fluentbit.io |
| 1660676817077 | [2022/08/16 19:06:57] [ info] Configuration: |
| 1660676817092 | [2022/08/16 19:06:57] [ info] flush time | 5.000000 seconds |
| 1660676817093 | [2022/08/16 19:06:57] [ info] grace | 30 seconds |
| 1660676817093 | [2022/08/16 19:06:57] [ info] daemon | 0 |
| 1660676817094 | [2022/08/16 19:06:57] [ info] ___________ |
| 1660676817095 | [2022/08/16 19:06:57] [ info] inputs: |
| 1660676817096 | [2022/08/16 19:06:57] [ info] forward |
| 1660676817096 | [2022/08/16 19:06:57] [ info] forward |
| 1660676817096 | [2022/08/16 19:06:57] [ info] tcp |
| 1660676817096 | [2022/08/16 19:06:57] [ info] forward |
| 1660676817097 | [2022/08/16 19:06:57] [ info] ___________ |
| 1660676817098 | [2022/08/16 19:06:57] [ info] filters: |
| 1660676817099 | [2022/08/16 19:06:57] [ info] record_modifier.0 |
| 1660676817101 | [2022/08/16 19:06:57] [ info] ___________ |
| 1660676817101 | [2022/08/16 19:06:57] [ info] outputs: |
| 1660676817101 | [2022/08/16 19:06:57] [ info] null.0 |
| 1660676817101 | [2022/08/16 19:06:57] [ info] loki.1 |
| 1660676817102 | [2022/08/16 19:06:57] [ info] ___________ |
| 1660676817102 | [2022/08/16 19:06:57] [ info] collectors: |
| 1660676817255 | [2022/08/16 19:06:57] [ info] [engine] started (pid=1) |
| 1660676817257 | [2022/08/16 19:06:57] [debug] [engine] coroutine stack size: 24576 bytes (24.0K) |
| 1660676817259 | [2022/08/16 19:06:57] [debug] [storage] [cio stream] new stream registered: forward.0 |
| 1660676817318 | [2022/08/16 19:06:57] [debug] [storage] [cio stream] new stream registered: forward.1 |
| 1660676817319 | [2022/08/16 19:06:57] [debug] [storage] [cio stream] new stream registered: tcp.2 |
| 1660676817340 | [2022/08/16 19:06:57] [debug] [storage] [cio stream] new stream registered: forward.3 |
| 1660676817459 | [2022/08/16 19:06:57] [ info] [storage] version=1.1.4, initializing... |
| 1660676817523 | [2022/08/16 19:06:57] [ info] [storage] in-memory |
| 1660676817524 | [2022/08/16 19:06:57] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128 |
| 1660676817524 | [2022/08/16 19:06:57] [ info] [cmetrics] version=0.2.2 |
| 1660676817524 | [2022/08/16 19:06:57] [ info] [input:forward:forward.0] listening on unix:///var/run/fluent.sock |
| 1660676817524 | [2022/08/16 19:06:57] [debug] [in_fw] Listen='0.0.0.0' TCP_Port=24224 |
| 1660676817524 | [2022/08/16 19:06:57] [ info] [input:forward:forward.1] listening on 0.0.0.0:24224 |
| 1660676817524 | [2022/08/16 19:06:57] [ info] [input:tcp:tcp.2] listening on 127.0.0.1:8877 |
| 1660676817524 | [2022/08/16 19:06:57] [ info] [input:forward:forward.3] listening on unix:///var/run/fluent.sock |
| 1660676817524 | [2022/08/16 19:06:57] [debug] [null:null.0] created event channels: read=23 write=24 |
| 1660676817525 | [2022/08/16 19:06:57] [debug] [loki:loki.1] created event channels: read=25 write=26 |
| 1660676818223 | [2022/08/16 19:06:58] [debug] [output:loki:loki.1] remove_mpa size: 5 |
| 1660676818285 | [2022/08/16 19:06:58] [ info] [output:loki:loki.1] configured, hostname=loki-qa.elimuinformatics.com:443 |
| 1660676818320 | [2022/08/16 19:06:58] [debug] [router] match rule tcp.2:null.0 |
| 1660676818320 | [2022/08/16 19:06:58] [ info] [output:loki:loki.1] worker #0 started |
| 1660676818529 | [2022/08/16 19:06:58] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020 |
| 1660676818529 | [2022/08/16 19:06:58] [ info] [sp] stream processor started |
| 1660676825921 | ==1== Warning: client switching stacks? SP change: 0xdffa378 --> 0xb514e50 |
| 1660676825921 | ==1== to suppress, use: --max-stackframe=44979496 or greater |
| 1660676825922 | ==1== Warning: client switching stacks? SP change: 0xb514db8 --> 0xdffa378 |
| 1660676825922 | ==1== to suppress, use: --max-stackframe=44979648 or greater |
| 1660676825923 | ==1== Warning: client switching stacks? SP change: 0xdffa428 --> 0xb514db8 |
| 1660676825923 | ==1== to suppress, use: --max-stackframe=44979824 or greater |
| 1660676825923 | ==1== further instances of this message will not be shown. |
| 1660676852938 | [2022/08/16 19:07:32] [debug] [task] created task=0xb7a68b0 id=0 OK |
| 1660676852988 | [2022/08/16 19:07:32] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0 |
| 1660676854071 | [2022/08/16 19:07:34] [debug] [http_client] not using http_proxy for header |
| 1660676854144 | [2022/08/16 19:07:34] [debug] [output:loki:loki.1] loki-qa.elimuinformatics.com:443, HTTP status=204 |
| 1660676854144 | [2022/08/16 19:07:34] [debug] [upstream] KA connection #50 to loki-qa.elimuinformatics.com:443 is now available |
| 1660676854144 | [2022/08/16 19:07:34] [debug] [task] destroy task=0xb7a68b0 (task_id=0) |
| 1660676854145 | [2022/08/16 19:07:34] [debug] [out coro] cb_destroy coro_id=0 |
| 1660676856800 | ==1== Thread 7 monkey: wrk/0: |
| 1660676856800 | ==1== Invalid write of size 8 |
| 1660676856800 | ==1== at 0x9D16E7: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856800 | ==1== Address 0xe0054f8 is in a rw- anonymous segment |
| 1660676856800 | ==1== |
| 1660676856801 | ==1== Invalid write of size 8 |
| 1660676856801 | ==1== at 0x9D16EB: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856801 | ==1== Address 0xe005500 is in a rw- anonymous segment |
| 1660676856801 | ==1== |
| 1660676856801 | ==1== Invalid write of size 8 |
| 1660676856801 | ==1== at 0x9D16EF: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856801 | ==1== Address 0xe005508 is in a rw- anonymous segment |
| 1660676856801 | ==1== |
| 1660676856802 | ==1== Invalid write of size 8 |
| 1660676856802 | ==1== at 0x9D16F3: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856802 | ==1== Address 0xe005510 is in a rw- anonymous segment |
| 1660676856802 | ==1== |
| 1660676856802 | ==1== Invalid write of size 8 |
| 1660676856802 | ==1== at 0x9D16F7: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856802 | ==1== Address 0xe005518 is in a rw- anonymous segment |
| 1660676856802 | ==1== |
| 1660676856802 | ==1== Invalid write of size 8 |
| 1660676856802 | ==1== at 0x9D16FB: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856802 | ==1== Address 0xe005520 is in a rw- anonymous segment |
| 1660676856802 | ==1== |
| 1660676856803 | ==1== Invalid read of size 8 |
| 1660676856803 | ==1== at 0x9D16FF: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856803 | ==1== Address 0xe17f618 is 8 bytes inside a block of size 25,088 alloc'd |
| 1660676856803 | ==1== at 0x4C2D065: malloc (vg_replace_malloc.c:381) |
| 1660676856803 | ==1== by 0x9D2800: co_create (amd64.c:142) |
| 1660676856803 | ==1== by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217) |
| 1660676856803 | ==1== by 0x9C3B31: mk_http_init (mk_http.c:748) |
| 1660676856803 | ==1== by 0x9C29FF: mk_http_request_prepare (mk_http.c:232) |
| 1660676856803 | ==1== by 0x9C5773: mk_http_sched_read (mk_http.c:1568) |
| 1660676856803 | ==1== by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693) |
| 1660676856803 | ==1== by 0x9C9D32: mk_server_worker_loop (mk_server.c:487) |
| 1660676856803 | ==1== by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416) |
| 1660676856803 | ==1== by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so) |
| 1660676856803 | ==1== by 0x686C56E: clone (in /usr/lib64/libc-2.26.so) |
| 1660676856803 | ==1== |
| 1660676856803 | ==1== Invalid read of size 8 |
| 1660676856803 | ==1== at 0x9D1703: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856803 | ==1== Address 0xe17f620 is 16 bytes inside a block of size 25,088 alloc'd |
| 1660676856803 | ==1== at 0x4C2D065: malloc (vg_replace_malloc.c:381) |
| 1660676856803 | ==1== by 0x9D2800: co_create (amd64.c:142) |
| 1660676856803 | ==1== by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217) |
| 1660676856803 | ==1== by 0x9C3B31: mk_http_init (mk_http.c:748) |
| 1660676856803 | ==1== by 0x9C29FF: mk_http_request_prepare (mk_http.c:232) |
| 1660676856803 | ==1== by 0x9C5773: mk_http_sched_read (mk_http.c:1568) |
| 1660676856803 | ==1== by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693) |
| 1660676856803 | ==1== by 0x9C9D32: mk_server_worker_loop (mk_server.c:487) |
| 1660676856803 | ==1== by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416) |
| 1660676856803 | ==1== by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so) |
| 1660676856803 | ==1== by 0x686C56E: clone (in /usr/lib64/libc-2.26.so) |
| 1660676856803 | ==1== |
| 1660676856804 | ==1== Invalid read of size 8 |
| 1660676856804 | ==1== at 0x9D1707: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856804 | ==1== Address 0xe17f628 is 24 bytes inside a block of size 25,088 alloc'd |
| 1660676856804 | ==1== at 0x4C2D065: malloc (vg_replace_malloc.c:381) |
| 1660676856804 | ==1== by 0x9D2800: co_create (amd64.c:142) |
| 1660676856804 | ==1== by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217) |
| 1660676856804 | ==1== by 0x9C3B31: mk_http_init (mk_http.c:748) |
| 1660676856804 | ==1== by 0x9C29FF: mk_http_request_prepare (mk_http.c:232) |
| 1660676856804 | ==1== by 0x9C5773: mk_http_sched_read (mk_http.c:1568) |
| 1660676856804 | ==1== by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693) |
| 1660676856804 | ==1== by 0x9C9D32: mk_server_worker_loop (mk_server.c:487) |
| 1660676856804 | ==1== by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416) |
| 1660676856804 | ==1== by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so) |
| 1660676856804 | ==1== by 0x686C56E: clone (in /usr/lib64/libc-2.26.so) |
| 1660676856804 | ==1== |
| 1660676856804 | ==1== Invalid read of size 8 |
| 1660676856804 | ==1== at 0x9D170B: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856804 | ==1== Address 0xe17f630 is 32 bytes inside a block of size 25,088 alloc'd |
| 1660676856804 | ==1== at 0x4C2D065: malloc (vg_replace_malloc.c:381) |
| 1660676856804 | ==1== by 0x9D2800: co_create (amd64.c:142) |
| 1660676856804 | ==1== by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217) |
| 1660676856804 | ==1== by 0x9C3B31: mk_http_init (mk_http.c:748) |
| 1660676856804 | ==1== by 0x9C29FF: mk_http_request_prepare (mk_http.c:232) |
| 1660676856804 | ==1== by 0x9C5773: mk_http_sched_read (mk_http.c:1568) |
| 1660676856804 | ==1== by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693) |
| 1660676856804 | ==1== by 0x9C9D32: mk_server_worker_loop (mk_server.c:487) |
| 1660676856804 | ==1== by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416) |
| 1660676856804 | ==1== by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so) |
| 1660676856804 | ==1== by 0x686C56E: clone (in /usr/lib64/libc-2.26.so) |
| 1660676856804 | ==1== |
| 1660676856804 | ==1== Invalid read of size 8 |
| 1660676856804 | ==1== at 0x9D170F: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856804 | ==1== Address 0xe17f638 is 40 bytes inside a block of size 25,088 alloc'd |
| 1660676856804 | ==1== at 0x4C2D065: malloc (vg_replace_malloc.c:381) |
| 1660676856804 | ==1== by 0x9D2800: co_create (amd64.c:142) |
| 1660676856804 | ==1== by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217) |
| 1660676856804 | ==1== by 0x9C3B31: mk_http_init (mk_http.c:748) |
| 1660676856804 | ==1== by 0x9C29FF: mk_http_request_prepare (mk_http.c:232) |
| 1660676856804 | ==1== by 0x9C5773: mk_http_sched_read (mk_http.c:1568) |
| 1660676856804 | ==1== by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693) |
| 1660676856804 | ==1== by 0x9C9D32: mk_server_worker_loop (mk_server.c:487) |
| 1660676856804 | ==1== by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416) |
| 1660676856804 | ==1== by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so) |
| 1660676856804 | ==1== by 0x686C56E: clone (in /usr/lib64/libc-2.26.so) |
| 1660676856804 | ==1== |
| 1660676856805 | ==1== Invalid read of size 8 |
| 1660676856805 | ==1== at 0x9D1713: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856805 | ==1== Address 0xe17f640 is 48 bytes inside a block of size 25,088 alloc'd |
| 1660676856805 | ==1== at 0x4C2D065: malloc (vg_replace_malloc.c:381) |
| 1660676856805 | ==1== by 0x9D2800: co_create (amd64.c:142) |
| 1660676856805 | ==1== by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217) |
| 1660676856805 | ==1== by 0x9C3B31: mk_http_init (mk_http.c:748) |
| 1660676856805 | ==1== by 0x9C29FF: mk_http_request_prepare (mk_http.c:232) |
| 1660676856805 | ==1== by 0x9C5773: mk_http_sched_read (mk_http.c:1568) |
| 1660676856805 | ==1== by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693) |
| 1660676856805 | ==1== by 0x9C9D32: mk_server_worker_loop (mk_server.c:487) |
| 1660676856805 | ==1== by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416) |
| 1660676856805 | ==1== by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so) |
| 1660676856805 | ==1== by 0x686C56E: clone (in /usr/lib64/libc-2.26.so) |
| 1660676856805 | ==1== |
| 1660676856805 | ==1== Invalid read of size 8 |
| 1660676856805 | ==1== at 0x9C7319: thread_get_libco_params (mk_http_thread.c:54) |
| 1660676856805 | ==1== Address 0xe005700 is in a rw- anonymous segment |
| 1660676856805 | ==1== |
| 1660676856806 | ==1== Invalid read of size 8 |
| 1660676856806 | ==1== at 0x9C7329: thread_get_libco_params (mk_http_thread.c:57) |
| 1660676856806 | ==1== Address 0xe0054b8 is in a rw- anonymous segment |
| 1660676856806 | ==1== |
| 1660676856806 | ==1== Invalid read of size 8 |
| 1660676856806 | ==1== at 0x9C73F9: thread_cb_init_vars (mk_http_thread.c:108) |
| 1660676856806 | ==1== Address 0xe17f580 is 0 bytes inside a block of size 80 alloc'd |
| 1660676856806 | ==1== at 0x4C2D065: malloc (vg_replace_malloc.c:381) |
| 1660676856806 | ==1== by 0x9C716F: mk_mem_alloc (mk_memory.h:53) |
| 1660676856806 | ==1== by 0x9C75EA: mk_thread_new (mk_thread_libco.h:108) |
| 1660676856806 | ==1== by 0x9C75EA: mk_http_thread_create (mk_http_thread.c:199) |
| 1660676856806 | ==1== by 0x9C3B31: mk_http_init (mk_http.c:748) |
| 1660676856806 | ==1== by 0x9C29FF: mk_http_request_prepare (mk_http.c:232) |
| 1660676856806 | ==1== by 0x9C5773: mk_http_sched_read (mk_http.c:1568) |
| 1660676856806 | ==1== by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693) |
| 1660676856806 | ==1== by 0x9C9D32: mk_server_worker_loop (mk_server.c:487) |
| 1660676856806 | ==1== by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416) |
| 1660676856806 | ==1== by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so) |
| 1660676856806 | ==1== by 0x686C56E: clone (in /usr/lib64/libc-2.26.so) |
| 1660676856806 | ==1== |
| 1660676856807 | ==1== Invalid read of size 8 |
| 1660676856807 | ==1== at 0x9D287D: co_switch (amd64.c:156) |
| 1660676856807 | ==1== Address 0xe005700 is in a rw- anonymous segment |
| 1660676856807 | ==1== |
| 1660676856807 | ==1== Invalid read of size 8 |
| 1660676856807 | ==1== at 0x9D288D: co_switch (amd64.c:157) |
| 1660676856807 | ==1== Address 0xe0056f0 is in a rw- anonymous segment |
| 1660676856807 | ==1== |
| 1660676856808 | ==1== Invalid read of size 8 |
| 1660676856808 | ==1== at 0x9D2897: co_switch (amd64.c:158) |
| 1660676856808 | ==1== Address 0xe005700 is in a rw- anonymous segment |
| 1660676856808 | ==1== |
| 1660676856809 | ==1== Invalid write of size 8 |
| 1660676856809 | ==1== at 0x9D28AE: co_switch (amd64.c:158) |
| 1660676856809 | ==1== Address 0xe0056f0 is in a rw- anonymous segment |
| 1660676856809 | ==1== |
| 1660676856809 | ==1== Invalid read of size 8 |
| 1660676856809 | ==1== at 0x9D28B1: co_switch (amd64.c:158) |
| 1660676856809 | ==1== Address 0xe005700 is in a rw- anonymous segment |
| 1660676856809 | ==1== |
| 1660676856810 | ==1== Invalid read of size 8 |
| 1660676856810 | ==1== at 0x9D28C1: co_switch (amd64.c:158) |
| 1660676856810 | ==1== Address 0xe0056f0 is in a rw- anonymous segment |
| 1660676856810 | ==1== |
| 1660676856811 | ==1== Invalid write of size 8 |
| 1660676856811 | ==1== at 0x9D16E0: co_swap_function (in /fluent-bit/bin/fluent-bit) |
| 1660676856811 | ==1== Address 0xe17f610 is 0 bytes inside a block of size 25,088 alloc'd |
| 1660676856811 | ==1== at 0x4C2D065: malloc (vg_replace_malloc.c:381) |
| 1660676856811 | ==1== by 0x9D2800: co_create (amd64.c:142) |
| 1660676856811 | ==1== by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217) |
| 1660676856811 | ==1== by 0x9C3B31: mk_http_init (mk_http.c:748) |
| 1660676856811 | ==1== by 0x9C29FF: mk_http_request_prepare (mk_http.c:232) |
| 1660676856811 | ==1== by 0x9C5773: mk_http_sched_read (mk_http.c:1568) |
| 1660676856811 | ==1== by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693) |
| 1660676856811 | ==1== by 0x9C9D32: mk_server_worker_loop (mk_server.c:487) |
| 1660676856811 | ==1== by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416) |
| 1660676856811 | ==1== by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so) |
| 1660676856811 | ==1== by 0x686C56E: clone (in /usr/lib64/libc-2.26.so) |
-----------------------------------------------------------------------------------------------------------------------------------
I just noticed that the example to add valgrind had an old version of aws-for-fluent-bit - version 2.21.0. The log above was from running that version. That ran successfully for nearly two days without exiting/crashing. I'm going to try with version 2.28.0 now.
I just noticed that the example to add valgrind had an old version of aws-for-fluent-bit - version 2.21.0.
@hankwallace Where did you see this? Our Dockerfile.debug is always supposed to be on the same version as the prod release: https://github.com/aws/aws-for-fluent-bit/blob/mainline/Dockerfile.debug#L4
@PettitWesley I saw it here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md
Also, I added the --error-limit=no
option after seeing valgrind complain that there were so many errors in the log that it wasn't going to show anymore.
@PettitWesley there's another change needed to get it to work with 2.28. Want me to submit a PR for it?
@hankwallace Sure
@PettitWesley I just submitted https://github.com/aws/aws-for-fluent-bit/pull/413
I have not seen a full crash in the last 2 days, but there are still lots of errors in the logs now that I'm using a build with valgrind. Is it possible that valgrind is preventing the crashes? Would uploading a log file just with the errors be helpful?
@hankwallace Please do, more data is usually helpful
@PettitWesley Here are two log files. It's very interesting that I haven't seen a crash/exit since running it with valgrind. The error that usually occurs a little while before a crash/exit is this one:
[error] [tls] error: unexpected EOF
That often signals a retry loop until it reaches the retry limit and then shortly afterwards a crash/exit. There is at least one EOF error in each log file, but the retries succeeded in these examples.
We're also running into this issue with a similar setup.
Our ECS Fargate tasks would encounter successfully send some logs to Grafana Cloud's Loki before encountering a TLS issue, retry then hit [error] [tls] error: unexpected EOF
before exiting and crashing the cluster. I don't have detailed logs since we terraformed destroyed the changes and logs with it to revert during an outage. But I wanted to chime in and provide our configuration. Not all of our tasks using this configuration experience the issue, but this particular one had about 30 workers fwiw. We tried with and without the healthcheck server running with no difference.
Problem happened with Fluentbit v1.9.3 (amazon/aws-for-fluent-bit:stable) and v1.9.7
2022-08-30T12:46:08.885-07:00 [2022/08/30 19:46:08] [error] [tls] error: unexpected EOF
2022-08-30T12:46:08.885-07:00 [2022/08/30 19:46:08] [error] [output:loki:loki.0] no upstream connections available
2022-08-30T12:45:17.884-07:00 [2022/08/30 19:45:17] [ warn] [engine] failed to flush chunk '1-1661888707.868562226.flb', retry in 9 seconds: task_id=0, input=forward.1 > output=loki.0 (out_id=0)
service.conf
[SERVICE]
Flush ${FLUSH_INTERVAL}
[OUTPUT]
Name loki
Match *
host ${LOKI_HOST}
port 443
http_user ${LOKI_USER}
http_passwd ${SECRET_LOKI_WRITE_APIKEY}
labels agent=fluent-bit, service={SERVICE}, region=${LOG_REGION}, environment=${ENV_NAME}
tls on
tls.verify on
remove_keys container_id, ecs_task_arn
label_keys $container_name, $ecs_task_definition, $source, $ecs_cluster
line_format key_value
[OUTPUT]
Name datadog
Host http-intake.logs.datadoghq.com
Match *
TLS on
compress gzip
provider ecs
apikey ${SECRET_DD_API_KEY}
dd_service ${SERVICE_ID}
dd_source ${DD_SOURCE}
dd_message_key ${DD_MESSAGE_KEY}
dd_tags ${DD_TAGS}
task definition:
{
"name": "firelens-fluentbit",
"image": "${firelens_fluentbit_image}:${firelens_fluentbit_tag}",
"essential": true,
"cpu": 10,
"memoryReservation": 256,
"firelensConfiguration": {
"type": "fluentbit",
"options": {
"config-file-type": "file",
"config-file-value": "/service.conf",
"enable-ecs-log-metadata": "${enable_ecs_log_metadata}"
}
},
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-region": "${log_region}",
"awslogs-group": "${firelens_log_group}",
"awslogs-stream-prefix": "firelens"
}
},
"environment": ${jsonencode([for env, val in envs[
"observability"
] : {
name = tostring(env)
value = tostring(val)
}
])
},
"secrets": ${jsonencode([for env, val in secrets[
"observability"
] : {
name = tostring(env)
valueFrom = tostring(val)
}
])
},
"mountPoints": [],
"portMappings": [],
"user": "0",
"volumesFrom": [],
"healthCheck": {
"command": [ "CMD-SHELL", "pgrep fluent-bit || exit 1" ],
"interval": 10,
"retries": 3,
"timeout": 5
}
},
{
"name": "${service}",
"image": "${image}:${image_tag}",
"cpu": 0,
"essential": true,
"mountPoints": [],
"portMappings": [],
"networkMode": "awsvpc",
"logConfiguration": {
"logDriver": "awsfirelens"
},
"ulimits": [
{
"name": "nofile",
"softLimit": 65536,
"hardLimit": 65536
}
],
"environment": ${jsonencode([for env, val in envs[
"service"
] : {
name = tostring(env)
value = tostring(val)
}
])
},
"secrets": ${jsonencode([for env, val in secrets[
"service"
] : {
name = tostring(env)
valueFrom = tostring(val)
}
])
},
"volumesFrom": []
}
@PettitWesley any update on this? I am still running all containers in debug mode to prevent them from failing.
Please see:
@PettitWesley we are not currently losing logs, but ONLY because we are running the debug configuration via valgrind. The issues mentioned in the known issues don't appear to include the issue here.
Hey all,
I'm a fellow logging enthusiast wanted to chime in here on the loki issue I see here and wanted to provide a potential solution to a problem I've had working with wanting to send ECS logs to both loki & cloudwatch with firelens.
TL;DR: I created a firelens docker image built from the aws-for-fluent-bit
base image, add in the out_grafana_loki.so
plugin you've built provided by the grafana/loki team then let the aws-for-fluent-bit
handle log streams. This solution was made to handle sending logs to two different endpoints (one for an internal self served loki backend for log consumption, and the other for 1 year log back with cloudwatch in case loki goes down internally)
It appears that @hankwallace has got the aws-for-fluent-bit
to have the extra.conf to use buffer & flush which will use the loki output provided by the fluentbit docs
Mem_Buf_Limit 2MB
[SERVICE]
Flush 5
Grace 30
While this can work, I believe @PettitWesley has pushed other ways to push logs by using plugins that are provided by using the plugin .so files, which the grafana team has provided (by creating the plugin after cloning the loki repo & making the .so file)
I got setup from looking at this comment when using the newrelic plugin and creating my own image from the base aws-for-fluent-bit
; curl -L -o /fluent-bit/newrelic.so "https://github.com/newrelic/newrelic-fluent-bit-output/releases/download/v1.7.0/out_newrelic-linux-amd64-1.7.0.so"
; chmod +x /fluent-bit/newrelic.so
; printf "[PLUGINS]\n Path /fluent-bit/newrelic.so\n" > /fluent-bit/etc/plugins.conf
; echo "${FLUENTBIT_CONFIG}" > /opt/fluentbit.conf
I too was trying to find a solution to work because I was able to get cloudwatch logs working properly with aws-for-fluent-bit
but it wasn't working for loki... I was getting logs lost after the initial startup because just using the fluentbit output because I don't believe that the Loki output from the fluent-bit side will use the aws-for-fluent-bit
buffer and flush mechanism. I was seeing logs not showing up in loki at all. (I tested this functionality on loki's cloud solution too, it wasn't a problem with my internal loki backend, I can provide additional context to prove this if necessary.)
Then I tried Loki's firelens solution where they create the dynamic plugin library and place it into an image to be used as a firelens image. When I used their proposed image grafana/fluent-bit-plugin-loki:2.7.4-amd64
I wasn't really able to get the cloudwatch endpoint to work, I wasn't given the flexibility to change around the fluent.conf file... It felt like the image was also probably using their own flush/buffer system. I haven't really researched pursuing this angle so my knowledge is limited
I also share @hankwallace 's concern on not being able to find something within the grafana or fluentbit communities, it seemed like there wasn't a lot of tutorial support on how to diagnose or fix an issue like this. But you left a lot of great nuggets in your comments @PettitWesley and I appreciate all AWS blog posts around firelens.
So I wanted to combine the two solutions.
First create the loki plugin from source.
ssh ECSBOX-AMD64-CHIPSET # to make sure whatever ECS system can use that plugin
git clone https://github.com/grafana/loki
cd loki
make fluent-bit-plugin # makefile in there to create the binary for grafana/loki for firelens
exit
scp EC2BOX-AMD64-CHIPSET:/clonepath/loki/clients/cmd/fluent-bit/out_grafana_loki.so .
then I created my own docker image from the init image and loading in the plugin.
FROM public.ecr.aws/aws-observability/aws-for-fluent-bit:init-latest
ADD out_grafana_loki.so /fluent-bit/
ADD fluent-bit.conf /fluent-bit/alt/fluent-bit.conf
CMD ["/fluent-bit/bin/fluent-bit", "-e", "/fluent-bit/cloudwatch.so", "-e", "/fluent-bit/out_grafana_loki.so", "-c", "/fluent-bit/alt/fluent-bit.conf"]
I created my own alt fluentbit conf to be used: (altered to just use samples, I have the task def configured to create the log groups and other components necessary to make this work)
grafana-loki
config options can be found here
[INPUT]
Name forward
unix_path /var/run/fluent.sock
[SERVICE]
Grace 30
[OUTPUT]
Name cloudwatch
Match *
region us-west-2
log_group_name /ecs/service
log_key log
log_stream_prefix ecs-
auto_create_group true
[Output]
Name grafana-loki
Match *
Url http://LOKIENDPOINT:3100/loki/api/v1/push
BatchWait 1s
TenantID tenantsample
BatchSize 30720
# (30KiB)
Labels {job="fluenttest"}
LineFormat key_value
This how I'm attempting to resolve having two stream endpoints and I hope this might be useful for documentation for firelens that @PettitWesley could use in the future to showcase on how to have firelens send logs to two different endpoints using their own plugins.
I believe using the out_grafana_loki.so
plugin might resolve your TLS connection drops @hankwallace and is closer to what the Grafana/Loki team will provide for support for fluentbit while staying with what @PettitWesley & the aws-for-fluent-bit
team will provide in the future for ongoing ECS firelens support instead of trying to use the fluentbit output.
I think fluentbit probably does have fixes but it’s all in 2.X.X versions of fluentbit and while firelens is working, it’s still on fluentbit 1.9.X. There’s a lot more to dive into and figure out but I hope this has helped.
So my work wasn't completed and I didn't fully grasp the init
concepts but I was able to get around it and got it to work so I'm making a forked init image. I want all the ECS metadata from the init process but I also want to pass that info to Loki.
So add in the plugin line to the init image: https://github.com/aws/aws-for-fluent-bit/blob/v2.31.4/Dockerfile.init#L28
FROM amazon/aws-for-fluent-bit:latest
+ADD out_grafana_loki.so /fluent-bit/
RUN mkdir -p /init
then tell the init process to include that plugin in the base command:
https://github.com/aws/aws-for-fluent-bit/blob/v2.31.4/init/fluent_bit_init_process.go#L32
// default Fluent Bit command
- baseCommand = "exec /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so"
+ baseCommand = "exec /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -e /fluent-bit/out_grafana_loki.so"
// global s3 client and flag
Sorry to extend this and if you need to move this to a separate issue or if there is a way to include this in a new feature to add in more plugins, please do so, just wanted to get this information to a fellow AWS user who wants to use loki and to see if this can help resolve their issue from a better maintained angle.
@wick02 Yea I think a different issue is needed for this. Also explain in it why the upstream loki doesn't work for you: https://docs.fluentbit.io/manual/pipeline/outputs/loki
Any updates on this issue? We are still running the debug container (using valgrind) to limit the crashes/failures.
Any updates? How can I help to move this forward? We are still using the debug container because the non-debug one fails intermittently.
Describe the question/issue
The
aws-for-fluent-bit
log router stops sending logs to Loki through a HTTPS proxy after a connection/tls failure. The container sometimes exits shortly after and doesn't have anything in its log to indicate why. This causes the entire ECS task to restart because I have the log router containeressential=true
so that we don't lose logs for a long period of time.I have searched the issues here and in the fluent-bit repo. I have also searched the Grafana and Fluent slack communities.
Configuration
Deployment:
awsfirelens
to route logs to theaws-for-fluent-bit
containeraws-for-fluent-bit
container is routing logs to a loki task in the same clusterRelevant parts of ECS task definition. The first container is the web app and the second is the log router:
The
extra.conf
file contains:Fluent Bit Log Output
Here's a partial log file where the error starts, the container fails to send any more logs (even on the retries), and then exits - killing the entire task because I have
essential=true
.Fluent Bit Version Info
We are running the current stable version.
Cluster Details