aws / aws-for-fluent-bit

The source of the amazon/aws-for-fluent-bit container image
Apache License 2.0
458 stars 135 forks source link

Loki Output to Grafana Cloud - Consistent SIGSEGV #771

Open conbon opened 11 months ago

conbon commented 11 months ago
### Describe the question/issue I am running several containers in ECS Fargate with the logConfiguration set to "awsfirelens". I am also configuring a fluent-bit container (using the aws-for-fluent-bit) to override/template configuration - mainly to allow us to set the Mem_Buf_Limit. I can see logs coming through in Grafana Cloud as normal, but as I bump the load, I very quickly get a SIGSEGV in the cloudwatch logs for fluent-bit container and the whole task exits. ### Configuration Dockerfile: ``` ARG UPSTREAM_IMAGE_TAG FROM amazon/aws-for-fluent-bit:${UPSTREAM_IMAGE_TAG} ADD fluent-bit.conf /fluent-bit/alt/fluent-bit.conf CMD ["/fluent-bit/bin/fluent-bit", "-c", "/fluent-bit/alt/fluent-bit.conf"] ``` fluent-bit.conf: ``` [SERVICE] Grace 5 Flush 5 # This is the required input to receive container stdout & stderr logs # with FireLens [INPUT] Name forward unix_path /var/run/fluent.sock # default memory buffer only for logs collected by this input storage.type memory # Total Max Memory Usage <= 2 * SUM(Each input Mem_Buf_Limit) Mem_Buf_Limit ${mem_buf_limit} [Output] Name loki Match * tls on tls.verify on host ${loki_host} port ${loki_port} http_user ${loki_user} http_passwd ${loki_passwd} labels ${loki_labels} label_keys $container_name line_format key_value remove_keys ecs_cluster, ecs_task_definition, container_id ``` Partial ECS Task Definition (there are more containers present): ``` { "name": "redis", "image": "redis:6.2.13-alpine", "repositoryCredentials": { "credentialsParameter": "xxx" }, "cpu": 0, "portMappings": [], "essential": true, "command": [ "redis-server", "--port", "6379", "--protected-mode", "no", "--tcp-backlog", "128", "--loglevel", "notice", "--save", "", "--maxclients", "6144", "--maxmemory", "256mb" ], "environment": [], "mountPoints": [], "volumesFrom": [], "linuxParameters": { "capabilities": { "add": [], "drop": [] }, "devices": [], "initProcessEnabled": true, "tmpfs": [] }, "readonlyRootFilesystem": false, "ulimits": [ { "name": "nofile", "softLimit": 8192, "hardLimit": 8192 } ], "logConfiguration": { "logDriver": "awsfirelens" }, "healthCheck": { "command": [ "CMD-SHELL", "redis-cli ping | grep -Eq '^PONG\\s*$' || exit 1" ], "interval": 30, "timeout": 5, "retries": 3, "startPeriod": 0 } }, { "name": "fluent-bit", "image": "fluent-bit:pr-stable", "repositoryCredentials": { "credentialsParameter": "xxx" }, "cpu": 0, "memory": 75, "portMappings": [], "essential": true, "environment": [ { "name": "FLB_LOG_LEVEL", "value": "debug" }, { "name": "mem_buf_limit", "value": "30MB" }, { "name": "loki_host", "value": "logs-prod-eu-west-0.grafana.net" }, { "name": "loki_port", "value": "443" }, { "name": "loki_user", "value": "xxx" }, { "name": "loki_passwd", "value": "xxx" }, { "name": "loki_labels", "value": "env=dev,network=test" } ], "mountPoints": [], "volumesFrom": [], "linuxParameters": { "capabilities": { "add": [], "drop": [] }, "devices": [], "initProcessEnabled": true, "tmpfs": [] }, "user": "0", "readonlyRootFilesystem": false, "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-create-group": "true", "awslogs-group": "/log-router", "awslogs-region": "eu-west-1", "awslogs-stream-prefix": "ecs" } }, "firelensConfiguration": { "type": "fluentbit", "options": { "enable-ecs-log-metadata": "true", "config-file-type": "file", "config-file-value": "/fluent-bit/alt/fluent-bit.conf" } } }, ``` ### Fluent Bit Log Output ``` 20 November 2023 at 16:02 (UTC) #13 0x4e2ef7 in output_pre_cb_flush() at include/fluent-bit/flb_output.h:522 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #14 0xa4fea6 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #15 0xffffffffffffffff in ???() at ???:0 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #3 0x45f6da in arena_dalloc_large() at lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:281 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #4 0x45f6da in arena_dalloc() at lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:323 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #5 0x45f6da in idalloctm() at lib/jemalloc-5.2.1/include/jemalloc/internal/jemalloc_internal_inlines_c.h:118 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #6 0x45f6da in ifree() at lib/jemalloc-5.2.1/src/jemalloc.c:2589 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #7 0x45f6da in je_free_default() at lib/jemalloc-5.2.1/src/jemalloc.c:2799 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #8 0x4dbd22 in flb_free() at include/fluent-bit/flb_mem.h:120 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #9 0x4dd014 in flb_sds_destroy() at src/flb_sds.c:470 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #10 0x5b41ea in pack_record() at plugins/out_loki/loki.c:992 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #11 0x5b4659 in loki_compose_payload() at plugins/out_loki/loki.c:1140 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #12 0x5b4738 in cb_loki_flush() at plugins/out_loki/loki.c:1167 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #0 0x49e372 in atomic_load_p() at lib/jemalloc-5.2.1/include/jemalloc/internal/atomic.h:62 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #1 0x49e372 in extent_arena_get() at lib/jemalloc-5.2.1/include/jemalloc/internal/extent_inlines.h:51 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) #2 0x49e372 in je_large_dalloc() at lib/jemalloc-5.2.1/src/large.c:361 adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) [2023/11/20 16:02:21] [debug] [task] created task=0x7efcd9a41c50 id=2 OK adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) [2023/11/20 16:02:21] [debug] [task] created task=0x7efcd9a427b0 id=17 OK adf33da5fc3c4937979427eff533b933 fluent-bit 20 November 2023 at 16:02 (UTC) [2023/11/20 16:02:21] [engine] caught signal (SIGSEGV) ``` ### Fluent Bit Version Info Which AWS for Fluent Bit Versions have you tried?* I have tried a whole list of versions: * 2.23.0 * 2.32.0 * stable * latest * + more ### Cluster Details * fargate with OOTB service discovery * 10 containers per task (including fluent-bit) ### Application Details We are attempting to keep the fluent-bit container under 100MB hard docker limit & therefore need to configure the Mem_Buf_Limit. We have set this at 30MB currently due to info indicating: `Total Max Memory Usage <= 2 * SUM(Each input Mem_Buf_Limit)`
medampudi commented 9 months ago

We seem to have the same issue at our end. where the log router failed when we are trying to route the data to multiple destinations one being cloudwatch and another to our OSS loki implemenation.

jakegut commented 8 months ago

This behavior is fixed in a more up-to-date version of Fluent Bit, past v2.0.7 I believe, see https://github.com/fluent/fluent-bit/commit/a93117c0556ae78706499ee218e7eab1efdc40df

The latest version of AWS for Fluent Bit (2.32.0) only includes Fluent Bit @ 1.9.10