Stops sending logs after connection/tls failure

hankwallace commented 2 years ago

Describe the question/issue

The aws-for-fluent-bit log router stops sending logs to Loki through a HTTPS proxy after a connection/tls failure. The container sometimes exits shortly after and doesn't have anything in its log to indicate why. This causes the entire ECS task to restart because I have the log router container essential=true so that we don't lose logs for a long period of time.

I have searched the issues here and in the fluent-bit repo. I have also searched the Grafana and Fluent slack communities.

Configuration

Deployment:

ECS cluster (not Fargate)
Primary container is using awsfirelens to route logs to the aws-for-fluent-bit container
The aws-for-fluent-bit container is routing logs to a loki task in the same cluster
The loki task is using S3 for storage

Relevant parts of ECS task definition. The first container is the web app and the second is the log router:

{
  "containerDefinitions": [
    {
      "logConfiguration": {
        "logDriver": "awsfirelens",
        "options": {
          "Name": "loki",
          "host": "loki-qa.elimuinformatics.com",
          "port": "443",
          "tls": "On",
          "http_user": "loki",
          "http_passwd": "<hidden>",
          "net.keepalive": "false",
          "workers": "1",
          "Retry_Limit": "5",
          "labels": "env=qa,service=sapphire-web",
          "label_keys": "$ecs_task_definition,$ec2_instance_id",
          "remove_keys": "container_id,container_name,ecs_task_arn",
          "line_format": "key_value"
        }
      },
      "memory": 4096,
      "memoryReservation": 256,
      "dependsOn": [
        {
          "containerName": "sapphire-web-log-router",
          "condition": "HEALTHY"
        }
      ],
      "essential": true,
      "name": "sapphire-web-qa"
    },
    {
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "/aws/ecs/firelens/sapphire-qa",
          "awslogs-region": "us-west-2",
          "awslogs-create-group": "true",
          "awslogs-stream-prefix": "firelens"
        }
      },
      "environment": [
        {
          "name": "FLB_LOG_LEVEL",
          "value": "debug"
        }
      ],
      "memoryReservation": 50,
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
          "config-file-type": "file",
          "enable-ecs-log-metadata": "true",
          "config-file-value": "/extra.conf"
        }
      },
      "healthCheck": {
        "retries": 2,
        "command": [
          "CMD-SHELL",
          "curl -f http://127.0.0.1:2020/api/v1/uptime || exit 1"
        ],
        "timeout": 5,
        "interval": 10,
        "startPeriod": 30
      },
      "essential": true,
      "name": "sapphire-web-log-router"
    }
  ]
}

The extra.conf file contains:

[INPUT]
    Name forward
    unix_path /var/run/fluent.sock
    Mem_Buf_Limit 2MB

[SERVICE]
    Flush 5
    Grace 30
    # Healh check setup
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_PORT 2020
    Health_Check On 
    HC_Errors_Count 5 
    HC_Retry_Failure_Count 5 
    HC_Period 30

Fluent Bit Log Output

Here's a partial log file where the error starts, the container fails to send any more logs (even on the retries), and then exits - killing the entire task because I have essential=true.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                                                message                                                                                 |
|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <snip>                                                                                                                                                                                 |
| 1660308990289 | [2022/08/12 12:56:30] [debug] [task] created task=0x7f35e0a4a200 id=0 OK                                                                                               |
| 1660308990289 | [2022/08/12 12:56:30] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660308990296 | [2022/08/12 12:56:30] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660308990296 | [2022/08/12 12:56:30] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660308990296 | [2022/08/12 12:56:30] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660308990296 | [2022/08/12 12:56:30] [debug] [retry] new retry created for task_id=0 attempts=1                                                                                       |
| 1660308990296 | [2022/08/12 12:56:30] [debug] [out flush] cb_destroy coro_id=326                                                                                                       |
| 1660308990296 | [2022/08/12 12:56:30] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 6 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1)   |
| 1660308991749 | [2022/08/12 12:56:31] [debug] [input chunk] update output instances with new chunk size diff=638                                                                       |
| 1660308992333 | [2022/08/12 12:56:32] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660308992689 | [2022/08/12 12:56:32] [debug] [input chunk] update output instances with new chunk size diff=672                                                                       |
| 1660308993279 | [2022/08/12 12:56:33] [debug] [input chunk] update output instances with new chunk size diff=695                                                                       |
| 1660308995289 | [2022/08/12 12:56:35] [debug] [task] created task=0x7f35e0a4a430 id=1 OK                                                                                               |
| 1660308996289 | [2022/08/12 12:56:36] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660308996295 | [2022/08/12 12:56:36] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660308996295 | [2022/08/12 12:56:36] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660308996295 | [2022/08/12 12:56:36] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660308996295 | [2022/08/12 12:56:36] [debug] [out flush] cb_destroy coro_id=327                                                                                                       |
| 1660308996295 | [2022/08/12 12:56:36] [debug] [retry] re-using retry for task_id=0 attempts=2                                                                                          |
| 1660308996295 | [2022/08/12 12:56:36] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 15 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1)  |
| 1660308999947 | [2022/08/12 12:56:39] [debug] [input chunk] update output instances with new chunk size diff=657                                                                       |
| 1660308999978 | [2022/08/12 12:56:39] [debug] [input chunk] update output instances with new chunk size diff=651                                                                       |
| 1660309000289 | [2022/08/12 12:56:40] [debug] [task] created task=0x7f35e0a4a4a0 id=2 OK                                                                                               |
| 1660309011289 | [2022/08/12 12:56:51] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660309011300 | [2022/08/12 12:56:51] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660309011300 | [2022/08/12 12:56:51] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660309011301 | [2022/08/12 12:56:51] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660309011301 | [2022/08/12 12:56:51] [debug] [out flush] cb_destroy coro_id=328                                                                                                       |
| 1660309011301 | [2022/08/12 12:56:51] [debug] [retry] re-using retry for task_id=0 attempts=3                                                                                          |
| 1660309011301 | [2022/08/12 12:56:51] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 11 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1)  |
| 1660309016156 | [2022/08/12 12:56:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309016364 | [2022/08/12 12:56:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309020289 | [2022/08/12 12:57:00] [debug] [task] created task=0x7f35e0a4a510 id=3 OK                                                                                               |
| 1660309022289 | [2022/08/12 12:57:02] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660309022299 | [2022/08/12 12:57:02] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660309022299 | [2022/08/12 12:57:02] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660309022299 | [2022/08/12 12:57:02] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660309022299 | [2022/08/12 12:57:02] [debug] [out flush] cb_destroy coro_id=329                                                                                                       |
| 1660309022299 | [2022/08/12 12:57:02] [debug] [retry] re-using retry for task_id=0 attempts=4                                                                                          |
| 1660309022299 | [2022/08/12 12:57:02] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 66 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1)  |
| 1660309022809 | [2022/08/12 12:57:02] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309025290 | [2022/08/12 12:57:05] [debug] [task] created task=0x7f35e0a4a580 id=4 OK                                                                                               |
| 1660309046143 | [2022/08/12 12:57:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309046357 | [2022/08/12 12:57:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309050289 | [2022/08/12 12:57:30] [debug] [task] created task=0x7f35e0a4a5f0 id=5 OK                                                                                               |
| 1660309053057 | [2022/08/12 12:57:33] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309055289 | [2022/08/12 12:57:35] [debug] [task] created task=0x7f35e0a4a660 id=6 OK                                                                                               |
| 1660309076181 | [2022/08/12 12:57:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309076364 | [2022/08/12 12:57:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309080289 | [2022/08/12 12:58:00] [debug] [task] created task=0x7f35e0a4a6d0 id=7 OK                                                                                               |
| 1660309080963 | [2022/08/12 12:58:00] [debug] [input chunk] update output instances with new chunk size diff=546                                                                       |
| 1660309082603 | [2022/08/12 12:58:02] [debug] [input chunk] update output instances with new chunk size diff=651                                                                       |
| 1660309083323 | [2022/08/12 12:58:03] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309085289 | [2022/08/12 12:58:05] [debug] [task] created task=0x7f35e0a4a740 id=8 OK                                                                                               |
| 1660309088289 | [2022/08/12 12:58:08] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660309088298 | [2022/08/12 12:58:08] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660309088298 | [2022/08/12 12:58:08] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660309088298 | [2022/08/12 12:58:08] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660309088298 | [2022/08/12 12:58:08] [debug] [out flush] cb_destroy coro_id=330                                                                                                       |
| 1660309088298 | [2022/08/12 12:58:08] [debug] [retry] re-using retry for task_id=0 attempts=5                                                                                          |
| 1660309088298 | [2022/08/12 12:58:08] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 131 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1) |
| 1660309088644 | [2022/08/12 12:58:08] [debug] [input chunk] update output instances with new chunk size diff=972                                                                       |
| 1660309090290 | [2022/08/12 12:58:10] [debug] [task] created task=0x7f35e0a4a7b0 id=9 OK                                                                                               |
| 1660309106169 | [2022/08/12 12:58:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309106381 | [2022/08/12 12:58:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309110289 | [2022/08/12 12:58:30] [debug] [task] created task=0x7f35e0a4a820 id=10 OK                                                                                              |
| 1660309113599 | [2022/08/12 12:58:33] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309115289 | [2022/08/12 12:58:35] [debug] [task] created task=0x7f35e0a4a890 id=11 OK                                                                                              |
| 1660309136181 | [2022/08/12 12:58:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309136423 | [2022/08/12 12:58:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309140289 | [2022/08/12 12:59:00] [debug] [task] created task=0x7f35e0a4a900 id=12 OK                                                                                              |
| 1660309144188 | [2022/08/12 12:59:04] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309145290 | [2022/08/12 12:59:05] [debug] [task] created task=0x7f35e0a4a970 id=13 OK                                                                                              |
| 1660309166174 | [2022/08/12 12:59:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309166398 | [2022/08/12 12:59:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309170289 | [2022/08/12 12:59:30] [debug] [task] created task=0x7f35e0a4a9e0 id=14 OK                                                                                              |
| 1660309174493 | [2022/08/12 12:59:34] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309175290 | [2022/08/12 12:59:35] [debug] [task] created task=0x7f35e0a4aa50 id=15 OK                                                                                              |
| 1660309196186 | [2022/08/12 12:59:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309196409 | [2022/08/12 12:59:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309200289 | [2022/08/12 13:00:00] [debug] [task] created task=0x7f35e0a4aac0 id=16 OK                                                                                              |
| 1660309204764 | [2022/08/12 13:00:04] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309205289 | [2022/08/12 13:00:05] [debug] [task] created task=0x7f35e0a4ab30 id=17 OK                                                                                              |
| 1660309219293 | [2022/08/12 13:00:19] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660309219298 | [2022/08/12 13:00:19] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660309219298 | [2022/08/12 13:00:19] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [out flush] cb_destroy coro_id=331                                                                                                       |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [task] task_id=0 reached retry-attempts limit 5/5                                                                                        |
| 1660309219298 | [2022/08/12 13:00:19] [ warn] [engine] chunk '1-1660308985.805729984.flb' cannot be retried: task_id=0, input=forward.3 > output=loki.1                                |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [task] destroy task=0x7f35e0a4a200 (task_id=0)                                                                                           |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=1 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=2 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=3 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=4 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=5 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=6 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=7 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=8 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=9 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=10 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=11 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=12 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=13 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=14 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=15 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=16 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=17 assigned to thread #0                                                                                    |
| 1660309220465 | AWS for Fluent Bit Container Image Version 2.27.0                                                                                                                      |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Fluent Bit Version Info

We are running the current stable version.

Cluster Details

ECS cluster using EC2 (not Fargate)
Load balancer
Sidecar deployment

PettitWesley commented 2 years ago

[INPUT] Name forward unix_path /var/run/fluent.sock Mem_Buf_Limit 2MB

Just checking- so you are using this blog/example? https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention

Also are any logs successfully making it to Loki? Here it mostly looks like every request fails? This seems to be mainly failing network. Can you curl the loki endpoint from an instance inside the same subnet?

Since you indicated its crashing/stopping unexpectedly, this is the technique for getting a stack trace so we can take it to upstream and fix it: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#segfaults-and-crashes-sigsegv

hankwallace commented 2 years ago

Yes, I looked at that blog/example along with many (many, many) others.

Yes, some logs successfully are making it to Loki. In the example I pasted above the service was sending logs for about 90 minutes before it failed. Curling the loki endpoint still works. That service is one of > 10 on the same subnet and all of the others continued working even after this one failed. The time it takes to fail and the service that it fails on is not repeatable.

I'll see what I can do re: getting the stack trace.

hankwallace commented 2 years ago

I'll post more of the log when it crashes/exits, but on startup I see a bunch of "invalid read" and "invalid write" now.

-----------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                     message                                                     |
|---------------|-----------------------------------------------------------------------------------------------------------------|
| 1660676815625 | ==1== Memcheck, a memory error detector                                                                         |
| 1660676815625 | ==1== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.                                           |
| 1660676815625 | ==1== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info                                        |
| 1660676815625 | ==1== Command: /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf                                    |
| 1660676815625 | ==1==                                                                                                           |
| 1660676816903 | [1mFluent Bit v1.8.8[0m                                                                                       |
| 1660676816906 | * [1m[93mCopyright (C) 2019-2021 The Fluent Bit Authors[0m                                                   |
| 1660676816908 | * [1m[93mCopyright (C) 2015-2018 Treasure Data[0m                                                            |
| 1660676816908 | * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd                                                |
| 1660676816908 | * https://fluentbit.io                                                                                          |
| 1660676817077 | [2022/08/16 19:06:57] [ info] Configuration:                                                                    |
| 1660676817092 | [2022/08/16 19:06:57] [ info]  flush time     | 5.000000 seconds                                                |
| 1660676817093 | [2022/08/16 19:06:57] [ info]  grace          | 30 seconds                                                      |
| 1660676817093 | [2022/08/16 19:06:57] [ info]  daemon         | 0                                                               |
| 1660676817094 | [2022/08/16 19:06:57] [ info] ___________                                                                       |
| 1660676817095 | [2022/08/16 19:06:57] [ info]  inputs:                                                                          |
| 1660676817096 | [2022/08/16 19:06:57] [ info]      forward                                                                      |
| 1660676817096 | [2022/08/16 19:06:57] [ info]      forward                                                                      |
| 1660676817096 | [2022/08/16 19:06:57] [ info]      tcp                                                                          |
| 1660676817096 | [2022/08/16 19:06:57] [ info]      forward                                                                      |
| 1660676817097 | [2022/08/16 19:06:57] [ info] ___________                                                                       |
| 1660676817098 | [2022/08/16 19:06:57] [ info]  filters:                                                                         |
| 1660676817099 | [2022/08/16 19:06:57] [ info]      record_modifier.0                                                            |
| 1660676817101 | [2022/08/16 19:06:57] [ info] ___________                                                                       |
| 1660676817101 | [2022/08/16 19:06:57] [ info]  outputs:                                                                         |
| 1660676817101 | [2022/08/16 19:06:57] [ info]      null.0                                                                       |
| 1660676817101 | [2022/08/16 19:06:57] [ info]      loki.1                                                                       |
| 1660676817102 | [2022/08/16 19:06:57] [ info] ___________                                                                       |
| 1660676817102 | [2022/08/16 19:06:57] [ info]  collectors:                                                                      |
| 1660676817255 | [2022/08/16 19:06:57] [ info] [engine] started (pid=1)                                                          |
| 1660676817257 | [2022/08/16 19:06:57] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)                                |
| 1660676817259 | [2022/08/16 19:06:57] [debug] [storage] [cio stream] new stream registered: forward.0                           |
| 1660676817318 | [2022/08/16 19:06:57] [debug] [storage] [cio stream] new stream registered: forward.1                           |
| 1660676817319 | [2022/08/16 19:06:57] [debug] [storage] [cio stream] new stream registered: tcp.2                               |
| 1660676817340 | [2022/08/16 19:06:57] [debug] [storage] [cio stream] new stream registered: forward.3                           |
| 1660676817459 | [2022/08/16 19:06:57] [ info] [storage] version=1.1.4, initializing...                                          |
| 1660676817523 | [2022/08/16 19:06:57] [ info] [storage] in-memory                                                               |
| 1660676817524 | [2022/08/16 19:06:57] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128       |
| 1660676817524 | [2022/08/16 19:06:57] [ info] [cmetrics] version=0.2.2                                                          |
| 1660676817524 | [2022/08/16 19:06:57] [ info] [input:forward:forward.0] listening on unix:///var/run/fluent.sock                |
| 1660676817524 | [2022/08/16 19:06:57] [debug] [in_fw] Listen='0.0.0.0' TCP_Port=24224                                           |
| 1660676817524 | [2022/08/16 19:06:57] [ info] [input:forward:forward.1] listening on 0.0.0.0:24224                              |
| 1660676817524 | [2022/08/16 19:06:57] [ info] [input:tcp:tcp.2] listening on 127.0.0.1:8877                                     |
| 1660676817524 | [2022/08/16 19:06:57] [ info] [input:forward:forward.3] listening on unix:///var/run/fluent.sock                |
| 1660676817524 | [2022/08/16 19:06:57] [debug] [null:null.0] created event channels: read=23 write=24                            |
| 1660676817525 | [2022/08/16 19:06:57] [debug] [loki:loki.1] created event channels: read=25 write=26                            |
| 1660676818223 | [2022/08/16 19:06:58] [debug] [output:loki:loki.1] remove_mpa size: 5                                           |
| 1660676818285 | [2022/08/16 19:06:58] [ info] [output:loki:loki.1] configured, hostname=loki-qa.elimuinformatics.com:443        |
| 1660676818320 | [2022/08/16 19:06:58] [debug] [router] match rule tcp.2:null.0                                                  |
| 1660676818320 | [2022/08/16 19:06:58] [ info] [output:loki:loki.1] worker #0 started                                            |
| 1660676818529 | [2022/08/16 19:06:58] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020                                  |
| 1660676818529 | [2022/08/16 19:06:58] [ info] [sp] stream processor started                                                     |
| 1660676825921 | ==1== Warning: client switching stacks?  SP change: 0xdffa378 --> 0xb514e50                                     |
| 1660676825921 | ==1==          to suppress, use: --max-stackframe=44979496 or greater                                           |
| 1660676825922 | ==1== Warning: client switching stacks?  SP change: 0xb514db8 --> 0xdffa378                                     |
| 1660676825922 | ==1==          to suppress, use: --max-stackframe=44979648 or greater                                           |
| 1660676825923 | ==1== Warning: client switching stacks?  SP change: 0xdffa428 --> 0xb514db8                                     |
| 1660676825923 | ==1==          to suppress, use: --max-stackframe=44979824 or greater                                           |
| 1660676825923 | ==1==          further instances of this message will not be shown.                                             |
| 1660676852938 | [2022/08/16 19:07:32] [debug] [task] created task=0xb7a68b0 id=0 OK                                             |
| 1660676852988 | [2022/08/16 19:07:32] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                              |
| 1660676854071 | [2022/08/16 19:07:34] [debug] [http_client] not using http_proxy for header                                     |
| 1660676854144 | [2022/08/16 19:07:34] [debug] [output:loki:loki.1] loki-qa.elimuinformatics.com:443, HTTP status=204            |
| 1660676854144 | [2022/08/16 19:07:34] [debug] [upstream] KA connection #50 to loki-qa.elimuinformatics.com:443 is now available |
| 1660676854144 | [2022/08/16 19:07:34] [debug] [task] destroy task=0xb7a68b0 (task_id=0)                                         |
| 1660676854145 | [2022/08/16 19:07:34] [debug] [out coro] cb_destroy coro_id=0                                                   |
| 1660676856800 | ==1== Thread 7 monkey: wrk/0:                                                                                   |
| 1660676856800 | ==1== Invalid write of size 8                                                                                   |
| 1660676856800 | ==1==    at 0x9D16E7: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856800 | ==1==  Address 0xe0054f8 is in a rw- anonymous segment                                                          |
| 1660676856800 | ==1==                                                                                                           |
| 1660676856801 | ==1== Invalid write of size 8                                                                                   |
| 1660676856801 | ==1==    at 0x9D16EB: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856801 | ==1==  Address 0xe005500 is in a rw- anonymous segment                                                          |
| 1660676856801 | ==1==                                                                                                           |
| 1660676856801 | ==1== Invalid write of size 8                                                                                   |
| 1660676856801 | ==1==    at 0x9D16EF: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856801 | ==1==  Address 0xe005508 is in a rw- anonymous segment                                                          |
| 1660676856801 | ==1==                                                                                                           |
| 1660676856802 | ==1== Invalid write of size 8                                                                                   |
| 1660676856802 | ==1==    at 0x9D16F3: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856802 | ==1==  Address 0xe005510 is in a rw- anonymous segment                                                          |
| 1660676856802 | ==1==                                                                                                           |
| 1660676856802 | ==1== Invalid write of size 8                                                                                   |
| 1660676856802 | ==1==    at 0x9D16F7: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856802 | ==1==  Address 0xe005518 is in a rw- anonymous segment                                                          |
| 1660676856802 | ==1==                                                                                                           |
| 1660676856802 | ==1== Invalid write of size 8                                                                                   |
| 1660676856802 | ==1==    at 0x9D16FB: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856802 | ==1==  Address 0xe005520 is in a rw- anonymous segment                                                          |
| 1660676856802 | ==1==                                                                                                           |
| 1660676856803 | ==1== Invalid read of size 8                                                                                    |
| 1660676856803 | ==1==    at 0x9D16FF: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856803 | ==1==  Address 0xe17f618 is 8 bytes inside a block of size 25,088 alloc'd                                       |
| 1660676856803 | ==1==    at 0x4C2D065: malloc (vg_replace_malloc.c:381)                                                         |
| 1660676856803 | ==1==    by 0x9D2800: co_create (amd64.c:142)                                                                   |
| 1660676856803 | ==1==    by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217)                                              |
| 1660676856803 | ==1==    by 0x9C3B31: mk_http_init (mk_http.c:748)                                                              |
| 1660676856803 | ==1==    by 0x9C29FF: mk_http_request_prepare (mk_http.c:232)                                                   |
| 1660676856803 | ==1==    by 0x9C5773: mk_http_sched_read (mk_http.c:1568)                                                       |
| 1660676856803 | ==1==    by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693)                                                  |
| 1660676856803 | ==1==    by 0x9C9D32: mk_server_worker_loop (mk_server.c:487)                                                   |
| 1660676856803 | ==1==    by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416)                                          |
| 1660676856803 | ==1==    by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so)                                          |
| 1660676856803 | ==1==    by 0x686C56E: clone (in /usr/lib64/libc-2.26.so)                                                       |
| 1660676856803 | ==1==                                                                                                           |
| 1660676856803 | ==1== Invalid read of size 8                                                                                    |
| 1660676856803 | ==1==    at 0x9D1703: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856803 | ==1==  Address 0xe17f620 is 16 bytes inside a block of size 25,088 alloc'd                                      |
| 1660676856803 | ==1==    at 0x4C2D065: malloc (vg_replace_malloc.c:381)                                                         |
| 1660676856803 | ==1==    by 0x9D2800: co_create (amd64.c:142)                                                                   |
| 1660676856803 | ==1==    by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217)                                              |
| 1660676856803 | ==1==    by 0x9C3B31: mk_http_init (mk_http.c:748)                                                              |
| 1660676856803 | ==1==    by 0x9C29FF: mk_http_request_prepare (mk_http.c:232)                                                   |
| 1660676856803 | ==1==    by 0x9C5773: mk_http_sched_read (mk_http.c:1568)                                                       |
| 1660676856803 | ==1==    by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693)                                                  |
| 1660676856803 | ==1==    by 0x9C9D32: mk_server_worker_loop (mk_server.c:487)                                                   |
| 1660676856803 | ==1==    by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416)                                          |
| 1660676856803 | ==1==    by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so)                                          |
| 1660676856803 | ==1==    by 0x686C56E: clone (in /usr/lib64/libc-2.26.so)                                                       |
| 1660676856803 | ==1==                                                                                                           |
| 1660676856804 | ==1== Invalid read of size 8                                                                                    |
| 1660676856804 | ==1==    at 0x9D1707: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856804 | ==1==  Address 0xe17f628 is 24 bytes inside a block of size 25,088 alloc'd                                      |
| 1660676856804 | ==1==    at 0x4C2D065: malloc (vg_replace_malloc.c:381)                                                         |
| 1660676856804 | ==1==    by 0x9D2800: co_create (amd64.c:142)                                                                   |
| 1660676856804 | ==1==    by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217)                                              |
| 1660676856804 | ==1==    by 0x9C3B31: mk_http_init (mk_http.c:748)                                                              |
| 1660676856804 | ==1==    by 0x9C29FF: mk_http_request_prepare (mk_http.c:232)                                                   |
| 1660676856804 | ==1==    by 0x9C5773: mk_http_sched_read (mk_http.c:1568)                                                       |
| 1660676856804 | ==1==    by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693)                                                  |
| 1660676856804 | ==1==    by 0x9C9D32: mk_server_worker_loop (mk_server.c:487)                                                   |
| 1660676856804 | ==1==    by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416)                                          |
| 1660676856804 | ==1==    by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so)                                          |
| 1660676856804 | ==1==    by 0x686C56E: clone (in /usr/lib64/libc-2.26.so)                                                       |
| 1660676856804 | ==1==                                                                                                           |
| 1660676856804 | ==1== Invalid read of size 8                                                                                    |
| 1660676856804 | ==1==    at 0x9D170B: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856804 | ==1==  Address 0xe17f630 is 32 bytes inside a block of size 25,088 alloc'd                                      |
| 1660676856804 | ==1==    at 0x4C2D065: malloc (vg_replace_malloc.c:381)                                                         |
| 1660676856804 | ==1==    by 0x9D2800: co_create (amd64.c:142)                                                                   |
| 1660676856804 | ==1==    by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217)                                              |
| 1660676856804 | ==1==    by 0x9C3B31: mk_http_init (mk_http.c:748)                                                              |
| 1660676856804 | ==1==    by 0x9C29FF: mk_http_request_prepare (mk_http.c:232)                                                   |
| 1660676856804 | ==1==    by 0x9C5773: mk_http_sched_read (mk_http.c:1568)                                                       |
| 1660676856804 | ==1==    by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693)                                                  |
| 1660676856804 | ==1==    by 0x9C9D32: mk_server_worker_loop (mk_server.c:487)                                                   |
| 1660676856804 | ==1==    by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416)                                          |
| 1660676856804 | ==1==    by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so)                                          |
| 1660676856804 | ==1==    by 0x686C56E: clone (in /usr/lib64/libc-2.26.so)                                                       |
| 1660676856804 | ==1==                                                                                                           |
| 1660676856804 | ==1== Invalid read of size 8                                                                                    |
| 1660676856804 | ==1==    at 0x9D170F: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856804 | ==1==  Address 0xe17f638 is 40 bytes inside a block of size 25,088 alloc'd                                      |
| 1660676856804 | ==1==    at 0x4C2D065: malloc (vg_replace_malloc.c:381)                                                         |
| 1660676856804 | ==1==    by 0x9D2800: co_create (amd64.c:142)                                                                   |
| 1660676856804 | ==1==    by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217)                                              |
| 1660676856804 | ==1==    by 0x9C3B31: mk_http_init (mk_http.c:748)                                                              |
| 1660676856804 | ==1==    by 0x9C29FF: mk_http_request_prepare (mk_http.c:232)                                                   |
| 1660676856804 | ==1==    by 0x9C5773: mk_http_sched_read (mk_http.c:1568)                                                       |
| 1660676856804 | ==1==    by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693)                                                  |
| 1660676856804 | ==1==    by 0x9C9D32: mk_server_worker_loop (mk_server.c:487)                                                   |
| 1660676856804 | ==1==    by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416)                                          |
| 1660676856804 | ==1==    by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so)                                          |
| 1660676856804 | ==1==    by 0x686C56E: clone (in /usr/lib64/libc-2.26.so)                                                       |
| 1660676856804 | ==1==                                                                                                           |
| 1660676856805 | ==1== Invalid read of size 8                                                                                    |
| 1660676856805 | ==1==    at 0x9D1713: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856805 | ==1==  Address 0xe17f640 is 48 bytes inside a block of size 25,088 alloc'd                                      |
| 1660676856805 | ==1==    at 0x4C2D065: malloc (vg_replace_malloc.c:381)                                                         |
| 1660676856805 | ==1==    by 0x9D2800: co_create (amd64.c:142)                                                                   |
| 1660676856805 | ==1==    by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217)                                              |
| 1660676856805 | ==1==    by 0x9C3B31: mk_http_init (mk_http.c:748)                                                              |
| 1660676856805 | ==1==    by 0x9C29FF: mk_http_request_prepare (mk_http.c:232)                                                   |
| 1660676856805 | ==1==    by 0x9C5773: mk_http_sched_read (mk_http.c:1568)                                                       |
| 1660676856805 | ==1==    by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693)                                                  |
| 1660676856805 | ==1==    by 0x9C9D32: mk_server_worker_loop (mk_server.c:487)                                                   |
| 1660676856805 | ==1==    by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416)                                          |
| 1660676856805 | ==1==    by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so)                                          |
| 1660676856805 | ==1==    by 0x686C56E: clone (in /usr/lib64/libc-2.26.so)                                                       |
| 1660676856805 | ==1==                                                                                                           |
| 1660676856805 | ==1== Invalid read of size 8                                                                                    |
| 1660676856805 | ==1==    at 0x9C7319: thread_get_libco_params (mk_http_thread.c:54)                                             |
| 1660676856805 | ==1==  Address 0xe005700 is in a rw- anonymous segment                                                          |
| 1660676856805 | ==1==                                                                                                           |
| 1660676856806 | ==1== Invalid read of size 8                                                                                    |
| 1660676856806 | ==1==    at 0x9C7329: thread_get_libco_params (mk_http_thread.c:57)                                             |
| 1660676856806 | ==1==  Address 0xe0054b8 is in a rw- anonymous segment                                                          |
| 1660676856806 | ==1==                                                                                                           |
| 1660676856806 | ==1== Invalid read of size 8                                                                                    |
| 1660676856806 | ==1==    at 0x9C73F9: thread_cb_init_vars (mk_http_thread.c:108)                                                |
| 1660676856806 | ==1==  Address 0xe17f580 is 0 bytes inside a block of size 80 alloc'd                                           |
| 1660676856806 | ==1==    at 0x4C2D065: malloc (vg_replace_malloc.c:381)                                                         |
| 1660676856806 | ==1==    by 0x9C716F: mk_mem_alloc (mk_memory.h:53)                                                             |
| 1660676856806 | ==1==    by 0x9C75EA: mk_thread_new (mk_thread_libco.h:108)                                                     |
| 1660676856806 | ==1==    by 0x9C75EA: mk_http_thread_create (mk_http_thread.c:199)                                              |
| 1660676856806 | ==1==    by 0x9C3B31: mk_http_init (mk_http.c:748)                                                              |
| 1660676856806 | ==1==    by 0x9C29FF: mk_http_request_prepare (mk_http.c:232)                                                   |
| 1660676856806 | ==1==    by 0x9C5773: mk_http_sched_read (mk_http.c:1568)                                                       |
| 1660676856806 | ==1==    by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693)                                                  |
| 1660676856806 | ==1==    by 0x9C9D32: mk_server_worker_loop (mk_server.c:487)                                                   |
| 1660676856806 | ==1==    by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416)                                          |
| 1660676856806 | ==1==    by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so)                                          |
| 1660676856806 | ==1==    by 0x686C56E: clone (in /usr/lib64/libc-2.26.so)                                                       |
| 1660676856806 | ==1==                                                                                                           |
| 1660676856807 | ==1== Invalid read of size 8                                                                                    |
| 1660676856807 | ==1==    at 0x9D287D: co_switch (amd64.c:156)                                                                   |
| 1660676856807 | ==1==  Address 0xe005700 is in a rw- anonymous segment                                                          |
| 1660676856807 | ==1==                                                                                                           |
| 1660676856807 | ==1== Invalid read of size 8                                                                                    |
| 1660676856807 | ==1==    at 0x9D288D: co_switch (amd64.c:157)                                                                   |
| 1660676856807 | ==1==  Address 0xe0056f0 is in a rw- anonymous segment                                                          |
| 1660676856807 | ==1==                                                                                                           |
| 1660676856808 | ==1== Invalid read of size 8                                                                                    |
| 1660676856808 | ==1==    at 0x9D2897: co_switch (amd64.c:158)                                                                   |
| 1660676856808 | ==1==  Address 0xe005700 is in a rw- anonymous segment                                                          |
| 1660676856808 | ==1==                                                                                                           |
| 1660676856809 | ==1== Invalid write of size 8                                                                                   |
| 1660676856809 | ==1==    at 0x9D28AE: co_switch (amd64.c:158)                                                                   |
| 1660676856809 | ==1==  Address 0xe0056f0 is in a rw- anonymous segment                                                          |
| 1660676856809 | ==1==                                                                                                           |
| 1660676856809 | ==1== Invalid read of size 8                                                                                    |
| 1660676856809 | ==1==    at 0x9D28B1: co_switch (amd64.c:158)                                                                   |
| 1660676856809 | ==1==  Address 0xe005700 is in a rw- anonymous segment                                                          |
| 1660676856809 | ==1==                                                                                                           |
| 1660676856810 | ==1== Invalid read of size 8                                                                                    |
| 1660676856810 | ==1==    at 0x9D28C1: co_switch (amd64.c:158)                                                                   |
| 1660676856810 | ==1==  Address 0xe0056f0 is in a rw- anonymous segment                                                          |
| 1660676856810 | ==1==                                                                                                           |
| 1660676856811 | ==1== Invalid write of size 8                                                                                   |
| 1660676856811 | ==1==    at 0x9D16E0: co_swap_function (in /fluent-bit/bin/fluent-bit)                                          |
| 1660676856811 | ==1==  Address 0xe17f610 is 0 bytes inside a block of size 25,088 alloc'd                                       |
| 1660676856811 | ==1==    at 0x4C2D065: malloc (vg_replace_malloc.c:381)                                                         |
| 1660676856811 | ==1==    by 0x9D2800: co_create (amd64.c:142)                                                                   |
| 1660676856811 | ==1==    by 0x9C76CD: mk_http_thread_create (mk_http_thread.c:217)                                              |
| 1660676856811 | ==1==    by 0x9C3B31: mk_http_init (mk_http.c:748)                                                              |
| 1660676856811 | ==1==    by 0x9C29FF: mk_http_request_prepare (mk_http.c:232)                                                   |
| 1660676856811 | ==1==    by 0x9C5773: mk_http_sched_read (mk_http.c:1568)                                                       |
| 1660676856811 | ==1==    by 0x9C1647: mk_sched_event_read (mk_scheduler.c:693)                                                  |
| 1660676856811 | ==1==    by 0x9C9D32: mk_server_worker_loop (mk_server.c:487)                                                   |
| 1660676856811 | ==1==    by 0x9C100F: mk_sched_launch_worker_loop (mk_scheduler.c:416)                                          |
| 1660676856811 | ==1==    by 0x4E4444A: start_thread (in /usr/lib64/libpthread-2.26.so)                                          |
| 1660676856811 | ==1==    by 0x686C56E: clone (in /usr/lib64/libc-2.26.so)                                                       |
-----------------------------------------------------------------------------------------------------------------------------------

hankwallace commented 2 years ago

I just noticed that the example to add valgrind had an old version of aws-for-fluent-bit - version 2.21.0. The log above was from running that version. That ran successfully for nearly two days without exiting/crashing. I'm going to try with version 2.28.0 now.

PettitWesley commented 2 years ago

I just noticed that the example to add valgrind had an old version of aws-for-fluent-bit - version 2.21.0.

@hankwallace Where did you see this? Our Dockerfile.debug is always supposed to be on the same version as the prod release: https://github.com/aws/aws-for-fluent-bit/blob/mainline/Dockerfile.debug#L4

hankwallace commented 2 years ago

@PettitWesley I saw it here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md

Also, I added the --error-limit=no option after seeing valgrind complain that there were so many errors in the log that it wasn't going to show anymore.

hankwallace commented 2 years ago

@PettitWesley there's another change needed to get it to work with 2.28. Want me to submit a PR for it?

PettitWesley commented 2 years ago

@hankwallace Sure

hankwallace commented 2 years ago

@PettitWesley I just submitted https://github.com/aws/aws-for-fluent-bit/pull/413

hankwallace commented 2 years ago

I have not seen a full crash in the last 2 days, but there are still lots of errors in the logs now that I'm using a build with valgrind. Is it possible that valgrind is preventing the crashes? Would uploading a log file just with the errors be helpful?

PettitWesley commented 2 years ago

@hankwallace Please do, more data is usually helpful

hankwallace commented 2 years ago

@PettitWesley Here are two log files. It's very interesting that I haven't seen a crash/exit since running it with valgrind. The error that usually occurs a little while before a crash/exit is this one:

[error] [tls] error: unexpected EOF

That often signals a retry loop until it reaches the retry limit and then shortly afterwards a crash/exit. There is at least one EOF error in each log file, but the retries succeeded in these examples.

firelens-logs.zip

mjmj commented 2 years ago

We're also running into this issue with a similar setup.

Our ECS Fargate tasks would encounter successfully send some logs to Grafana Cloud's Loki before encountering a TLS issue, retry then hit [error] [tls] error: unexpected EOF before exiting and crashing the cluster. I don't have detailed logs since we terraformed destroyed the changes and logs with it to revert during an outage. But I wanted to chime in and provide our configuration. Not all of our tasks using this configuration experience the issue, but this particular one had about 30 workers fwiw. We tried with and without the healthcheck server running with no difference.

Problem happened with Fluentbit v1.9.3 (amazon/aws-for-fluent-bit:stable) and v1.9.7

2022-08-30T12:46:08.885-07:00        [2022/08/30 19:46:08] [error] [tls] error: unexpected EOF
2022-08-30T12:46:08.885-07:00   [2022/08/30 19:46:08] [error] [output:loki:loki.0] no upstream connections available
2022-08-30T12:45:17.884-07:00 [2022/08/30 19:45:17] [ warn] [engine] failed to flush chunk '1-1661888707.868562226.flb', retry in 9 seconds: task_id=0, input=forward.1 > output=loki.0 (out_id=0)

service.conf

[SERVICE]
    Flush ${FLUSH_INTERVAL}

[OUTPUT]
    Name loki
    Match *
    host ${LOKI_HOST}
    port 443
    http_user ${LOKI_USER}
    http_passwd ${SECRET_LOKI_WRITE_APIKEY}
    labels agent=fluent-bit, service={SERVICE}, region=${LOG_REGION}, environment=${ENV_NAME}
    tls on
    tls.verify on
    remove_keys container_id, ecs_task_arn
    label_keys $container_name, $ecs_task_definition, $source, $ecs_cluster
    line_format key_value

[OUTPUT]
    Name datadog
    Host http-intake.logs.datadoghq.com
    Match *
    TLS on
    compress gzip
    provider ecs
    apikey ${SECRET_DD_API_KEY}
    dd_service ${SERVICE_ID}
    dd_source ${DD_SOURCE}
    dd_message_key ${DD_MESSAGE_KEY}
    dd_tags ${DD_TAGS}

task definition:

   {
    "name": "firelens-fluentbit",
    "image": "${firelens_fluentbit_image}:${firelens_fluentbit_tag}",
    "essential": true,
    "cpu": 10,
    "memoryReservation": 256,
    "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
            "config-file-type": "file",
            "config-file-value": "/service.conf",
            "enable-ecs-log-metadata": "${enable_ecs_log_metadata}"
        }
    },
    "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
            "awslogs-region": "${log_region}",
            "awslogs-group": "${firelens_log_group}",
            "awslogs-stream-prefix": "firelens"
        }
    },
    "environment": ${jsonencode([for env, val in envs[
                "observability"
            ] : {
        name  = tostring(env)
        value = tostring(val)
            }
        ])
    },
    "secrets": ${jsonencode([for env, val in secrets[
                "observability"
            ] : {
        name      = tostring(env)
        valueFrom = tostring(val)
            }
        ])
    },
    "mountPoints": [],
    "portMappings": [],
    "user": "0",
    "volumesFrom": [],
    "healthCheck": {
        "command": [ "CMD-SHELL", "pgrep fluent-bit || exit 1" ],
        "interval": 10,
        "retries": 3,
        "timeout": 5
    }
},
{
    "name": "${service}",
    "image": "${image}:${image_tag}",
    "cpu": 0,
    "essential": true,
    "mountPoints": [],
    "portMappings": [],
    "networkMode": "awsvpc",
    "logConfiguration": {
        "logDriver": "awsfirelens"
    },
    "ulimits": [
        {
            "name": "nofile",
            "softLimit": 65536,
            "hardLimit": 65536
        }
    ],
    "environment": ${jsonencode([for env, val in envs[
                "service"
            ] : {
        name  = tostring(env)
        value = tostring(val)
            }
        ])
    },
    "secrets": ${jsonencode([for env, val in secrets[
                "service"
            ] : {
        name      = tostring(env)
        valueFrom = tostring(val)
            }
        ])
    },
    "volumesFrom": []
}

hankwallace commented 1 year ago

@PettitWesley any update on this? I am still running all containers in debug mode to prevent them from failing.

PettitWesley commented 1 year ago

Please see:

https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#how-do-i-tell-if-fluent-bit-is-losing-logs
Recent known high impact issues: https://github.com/aws/aws-for-fluent-bit/issues/542

hankwallace commented 1 year ago

@PettitWesley we are not currently losing logs, but ONLY because we are running the debug configuration via valgrind. The issues mentioned in the known issues don't appear to include the issue here.

wick02 commented 1 year ago

Hey all,

I'm a fellow logging enthusiast wanted to chime in here on the loki issue I see here and wanted to provide a potential solution to a problem I've had working with wanting to send ECS logs to both loki & cloudwatch with firelens.

TL;DR: I created a firelens docker image built from the aws-for-fluent-bit base image, add in the out_grafana_loki.so plugin you've built provided by the grafana/loki team then let the aws-for-fluent-bit handle log streams. This solution was made to handle sending logs to two different endpoints (one for an internal self served loki backend for log consumption, and the other for 1 year log back with cloudwatch in case loki goes down internally)

It appears that @hankwallace has got the aws-for-fluent-bit to have the extra.conf to use buffer & flush which will use the loki output provided by the fluentbit docs

    Mem_Buf_Limit 2MB

[SERVICE]
    Flush 5
    Grace 30

While this can work, I believe @PettitWesley has pushed other ways to push logs by using plugins that are provided by using the plugin .so files, which the grafana team has provided (by creating the plugin after cloning the loki repo & making the .so file)

I got setup from looking at this comment when using the newrelic plugin and creating my own image from the base aws-for-fluent-bit

              ; curl -L -o /fluent-bit/newrelic.so "https://github.com/newrelic/newrelic-fluent-bit-output/releases/download/v1.7.0/out_newrelic-linux-amd64-1.7.0.so"
              ; chmod +x /fluent-bit/newrelic.so
              ; printf "[PLUGINS]\n  Path /fluent-bit/newrelic.so\n" > /fluent-bit/etc/plugins.conf
              ; echo "${FLUENTBIT_CONFIG}" > /opt/fluentbit.conf

I too was trying to find a solution to work because I was able to get cloudwatch logs working properly with aws-for-fluent-bitbut it wasn't working for loki... I was getting logs lost after the initial startup because just using the fluentbit output because I don't believe that the Loki output from the fluent-bit side will use the aws-for-fluent-bit buffer and flush mechanism. I was seeing logs not showing up in loki at all. (I tested this functionality on loki's cloud solution too, it wasn't a problem with my internal loki backend, I can provide additional context to prove this if necessary.)

Then I tried Loki's firelens solution where they create the dynamic plugin library and place it into an image to be used as a firelens image. When I used their proposed image grafana/fluent-bit-plugin-loki:2.7.4-amd64 I wasn't really able to get the cloudwatch endpoint to work, I wasn't given the flexibility to change around the fluent.conf file... It felt like the image was also probably using their own flush/buffer system. I haven't really researched pursuing this angle so my knowledge is limited

I also share @hankwallace 's concern on not being able to find something within the grafana or fluentbit communities, it seemed like there wasn't a lot of tutorial support on how to diagnose or fix an issue like this. But you left a lot of great nuggets in your comments @PettitWesley and I appreciate all AWS blog posts around firelens.

So I wanted to combine the two solutions.

First create the loki plugin from source.

ssh ECSBOX-AMD64-CHIPSET # to make sure whatever ECS system can use that plugin
git clone https://github.com/grafana/loki
cd loki
make fluent-bit-plugin # makefile in there to create the binary for grafana/loki for firelens
exit
scp EC2BOX-AMD64-CHIPSET:/clonepath/loki/clients/cmd/fluent-bit/out_grafana_loki.so .

then I created my own docker image from the init image and loading in the plugin.

FROM public.ecr.aws/aws-observability/aws-for-fluent-bit:init-latest

ADD out_grafana_loki.so /fluent-bit/
ADD fluent-bit.conf /fluent-bit/alt/fluent-bit.conf

CMD ["/fluent-bit/bin/fluent-bit", "-e", "/fluent-bit/cloudwatch.so", "-e", "/fluent-bit/out_grafana_loki.so", "-c", "/fluent-bit/alt/fluent-bit.conf"]

I created my own alt fluentbit conf to be used: (altered to just use samples, I have the task def configured to create the log groups and other components necessary to make this work)

grafana-lokiconfig options can be found here

[INPUT]
    Name forward
    unix_path /var/run/fluent.sock

[SERVICE]
    Grace 30

[OUTPUT]
    Name                cloudwatch
    Match               *
    region              us-west-2
    log_group_name      /ecs/service
    log_key             log
    log_stream_prefix   ecs-
    auto_create_group   true

[Output]
    Name grafana-loki
    Match *
    Url http://LOKIENDPOINT:3100/loki/api/v1/push
    BatchWait 1s
    TenantID tenantsample
    BatchSize 30720
    # (30KiB)
    Labels {job="fluenttest"}
    LineFormat key_value

This how I'm attempting to resolve having two stream endpoints and I hope this might be useful for documentation for firelens that @PettitWesley could use in the future to showcase on how to have firelens send logs to two different endpoints using their own plugins.

I believe using the out_grafana_loki.so plugin might resolve your TLS connection drops @hankwallace and is closer to what the Grafana/Loki team will provide for support for fluentbit while staying with what @PettitWesley & the aws-for-fluent-bit team will provide in the future for ongoing ECS firelens support instead of trying to use the fluentbit output.

I think fluentbit probably does have fixes but it’s all in 2.X.X versions of fluentbit and while firelens is working, it’s still on fluentbit 1.9.X. There’s a lot more to dive into and figure out but I hope this has helped.

wick02 commented 1 year ago

So my work wasn't completed and I didn't fully grasp the init concepts but I was able to get around it and got it to work so I'm making a forked init image. I want all the ECS metadata from the init process but I also want to pass that info to Loki.

So add in the plugin line to the init image: https://github.com/aws/aws-for-fluent-bit/blob/v2.31.4/Dockerfile.init#L28

 FROM amazon/aws-for-fluent-bit:latest
+ADD out_grafana_loki.so /fluent-bit/

 RUN mkdir -p /init

then tell the init process to include that plugin in the base command:

https://github.com/aws/aws-for-fluent-bit/blob/v2.31.4/init/fluent_bit_init_process.go#L32

        // default Fluent Bit command
-       baseCommand = "exec /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so"
+       baseCommand = "exec /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -e /fluent-bit/out_grafana_loki.so"

        // global s3 client and flag

Sorry to extend this and if you need to move this to a separate issue or if there is a way to include this in a new feature to add in more plugins, please do so, just wanted to get this information to a fellow AWS user who wants to use loki and to see if this can help resolve their issue from a better maintained angle.

PettitWesley commented 1 year ago

@wick02 Yea I think a different issue is needed for this. Also explain in it why the upstream loki doesn't work for you: https://docs.fluentbit.io/manual/pipeline/outputs/loki

hankwallace commented 1 year ago

Any updates on this issue? We are still running the debug container (using valgrind) to limit the crashes/failures.

hankwallace commented 10 months ago

Any updates? How can I help to move this forward? We are still using the debug container because the non-debug one fails intermittently.

aws / aws-for-fluent-bit