fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.89k stars 1.59k forks source link

Fluent-bit in ECS: Domain name not found #9637

Open neartik opened 5 days ago

neartik commented 5 days ago

Bug Report

Describe the bug When trying to run the image amazon/aws-for-fluent-bit:latest or any stable previous version, when the task boots it cannot reach the elastic cluster.

To Reproduce

Follow the tutorial from elastic: https://www.elastic.co/blog/elastic-cloud-with-aws-firelens-accelerate-time-to-insight-with-agentless-data-ingestion

For ECS, the task will look like this:

{ "family": "firelens-fargate-elastic", "taskRoleArn": "**redacted**", "executionRoleArn": "**redacted**", "networkMode": "awsvpc", "cpu": "512", "memory": "1024", "requiresCompatibilities": [ "FARGATE" ], "containerDefinitions": [ { "essential": true, "image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:2.32.4", "name": "log_router", "firelensConfiguration": { "type": "fluentbit", "options": { "enable-ecs-log-metadata": "true" } }, "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "firelens-container", "awslogs-region": "eu-west-1", "awslogs-create-group": "true", "awslogs-stream-prefix": "firelens" } }, "memoryReservation": 50 }, { "essential": true, "image": "nginx", "name": "app", "logConfiguration": { "logDriver":"awsfirelens", "secretOptions": [ { "valueFrom": "**redacted**:CLOUD_ID::", "name": "Cloud_ID" }, { "valueFrom": "**redacted**:CLOUD_AUTH::", "name": "Cloud_Auth" } ], "options": { "Name": "es", "Port": "9243", "Tag_Key tags": "tags", "Include_Tag_Key": "true", "Index": "elastic_firelens", "tls": "On", "tls.verify": "Off" }}, "memoryReservation": 100 } ] }

When deploying the task, make sure that it is accessible with a public IP and that it leads to the NGINX container. The logs of the log router will show:

24 November 2024 at 00:06 (UTC) [2024/11/24 00:06:49] [ warn] [net] getaddrinfo(host='**redacted**.eu-west-1.aws.found.io:443', err=4): Domain name not found [2024/11/24 00:06:49] [ warn] [engine] failed to flush chunk '1-1732406808.778340835.flb', retry in 7 seconds: task_id=0, input=forward.1 > output=es.1 (out_id=1)

Note that redacted.eu-west-1.aws.found.io:443 is accessible from the browser at the time I get this error. If Cloud_ID is edited to remove the port, the logs look different like an invalid argument is provided.

Expected behavior

The logs should go to Elastic.

Your Environment

Additional context

The goal is to have this tool to send all logs to Elastic from the 200 tasks running as a sidecar for each. It is not manageable to have an agent instead.