aws / aws-for-fluent-bit

The source of the amazon/aws-for-fluent-bit container image
Apache License 2.0
450 stars 133 forks source link

Occasional bad log routing and malformed messages get to s3 output #834

Open fizyk opened 3 months ago

fizyk commented 3 months ago
### Describe the question/issue Hi, We have a system, that under stress experiences two issues with routing messages through a fluent bit. We utilise fluent bit to pass logs to datadog, and two kinds of messages to s3. Problem is that under heavy load some messages get occasionally routed in wrong way - dimensions to events and events to dimensions, mixing the two (10-40k correct events and up to 400 of wrongly routed messages), and in the same files that have the wrongly routed messages there are lines with messages put in the same line instead of single messages per line. ### Configuration #### ECS Task ``` { "taskDefinitionArn": "arn:aws:ecs:::task-definition/-ecs-cluster--:39", "containerDefinitions": [ { "name": "-", "image": ".dkr.ecr..amazonaws.com/:1.6.0", "cpu": 0, "memoryReservation": 696, "portMappings": [ { "containerPort": 8090, "hostPort": 8090, "protocol": "tcp" }, { "containerPort": 443, "hostPort": 443, "protocol": "tcp" } ], "essential": true, "environment": [ { "name": "GOMAXPROCS", "value": "1" }, { "name": "DD_METRICS_SAMPLING_RATE", "value": "1.0" }, { "name": "AWS_MAX_IDLE_CONNECTIONS", "value": "512" }, { "name": "AWS_DISABLE_SSL", "value": "true" }, { "name": "LOG_LEVEL", "value": "info" }, { "name": "DD_METRICS_PORT", "value": "8125" }, { "name": "AWS_COGNITO_REFRESH_INTERVAL", "value": "500s" }, { "name": "DD_SERVICE_RETRY_INTERVAL", "value": "10s" }, { "name": "ENABLE_DNSMASQ", "value": "true" }, { "name": "FLUENT_ADDRESS", "value": "localhost:5170" }, { "name": "AWS_DEFAULT_REGION", "value": "" }, { "name": "DD_SERVICE_RETRY_ATTEMPTS", "value": "30" }, { "name": "LOG_ENVIRONMENT", "value": "" }, { "name": "AWS_COGNITO_SCOPES", "value": "/" }, { "name": "DD_AGENT_HOST", "value": "localhost" }, { "name": "FLUENT_PARALLEL_FLUSH_LIMIT", "value": "16" }, { "name": "FLUENT_FLUSH_INTERVAL", "value": "10s" }, { "name": "DD_METRICS_ENABLED", "value": "true" }, { "name": "FLUENT_BATCH_COUNT", "value": "100" } ], "mountPoints": [], "volumesFrom": [], "secrets": [ ... ], "dependsOn": [ { "containerName": "fluentbit", "condition": "HEALTHY" }, { "containerName": "datadog", "condition": "HEALTHY" }, { "containerName": "dnsmasq", "condition": "HEALTHY" } ], "stopTimeout": 30, "user": "0", "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/aws/ecs/-", "awslogs-region": "", "awslogs-stream-prefix": "-" } }, "systemControls": [] }, { "name": "datadog", "image": ".dkr.ecr..amazonaws.com/publecr/datadog/agent:7", "cpu": 128, "memoryReservation": 100, "portMappings": [ { "containerPort": 8125, "hostPort": 8125, "protocol": "tcp" }, { "containerPort": 8126, "hostPort": 8126, "protocol": "tcp" } ], "essential": true, "environment": [ { "name": "DD_APM_NON_LOCAL_TRAFFIC", "value": "true" }, { "name": "DD_TRACE_SAMPLE_RATE", "value": "1.0" }, { "name": "DD_SITE", "value": "datadoghq.com" }, { "name": "DD_DOGSTATSD_TAG_CARDINALITY", "value": "high" }, { "name": "DD_APM_ENABLED", "value": "true" }, { "name": "DD_ENV", "value": "" }, { "name": "DD_CHECKS_TAG_CARDINALITY", "value": "high" }, { "name": "DD_DOGSTATSD_NON_LOCAL_TRAFFIC", "value": "true" }, { "name": "DD_LOG_LEVEL", "value": "CRITICAL" }, { "name": "DD_CLOUD_PROVIDER_METADATA", "value": "aws" } ], "mountPoints": [ { "sourceVolume": "docker_sock", "containerPath": "/var/run/docker.sock" }, { "sourceVolume": "cgroup", "containerPath": "/host/sys/fs/cgroup" }, { "sourceVolume": "proc", "containerPath": "/host/proc" } ], "volumesFrom": [], "secrets": [ ... ], "user": "0", "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/aws/ecs/-", "awslogs-region": "", "awslogs-stream-prefix": "-" } }, "healthCheck": { "command": [ "./probe.sh" ], "interval": 30, "timeout": 5, "retries": 3 }, "systemControls": [] }, { "name": "fluentbit", "image": ".dkr.ecr..amazonaws.com/:1.6.0", "cpu": 0, "memoryReservation": 128, "portMappings": [ { "containerPort": 5170, "hostPort": 5170, "protocol": "tcp" } ], "essential": true, "environment": [ { "name": "EVENTS_OUTPUT_RETRY_LIMIT", "value": "1" }, { "name": "AWS_REGION", "value": "" }, { "name": "DIMENSIONS_OUTPUT_RETRY_LIMIT", "value": "1" }, { "name": "FLUENT_BUFFER_MAX_SIZE", "value": "6M" }, { "name": "EVENTS_OUTPUT_WORKERS", "value": "1" }, { "name": "DIMENSIONS_OUTPUT_TOTAL_FILE_SIZE", "value": "25M" }, { "name": "DD_SOURCE", "value": "-" }, { "name": "DD_TAGS", "value": "" }, { "name": "EVENTS_OUTPUT_UPLOAD_TIMEOUT", "value": "10s" }, { "name": "EVENTS_OUTPUT_TOTAL_FILE_SIZE", "value": "25M" }, { "name": "AWS_S3_BUCKET", "value": "" }, { "name": "FLUENT_BIT_S3_EXPORTER", "value": "true" }, { "name": "DD_SERVICE", "value": "-" }, { "name": "SERVICE_HC_PERIOD", "value": "5" }, { "name": "FLUENT_BUFFER_CHUNK_SIZE", "value": "1M" }, { "name": "SERVICE_GRACE", "value": "15" }, { "name": "DIMENSIONS_OUTPUT_UPLOAD_TIMEOUT", "value": "10s" }, { "name": "DIMENSIONS_OUTPUT_WORKERS", "value": "1" }, { "name": "SERVICE_HC_RETRY_FAILURE_COUNT", "value": "5" }, { "name": "SERVICE_LOG_LEVEL", "value": "info" }, { "name": "SERVICE_FLUSH", "value": "10" }, { "name": "SERVICE_HC_ERRORS_COUNT", "value": "5" } ], "mountPoints": [], "volumesFrom": [], "secrets": [ ... ], "user": "0", "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/aws/ecs/-", "awslogs-region": "", "awslogs-stream-prefix": "-" } }, "healthCheck": { "command": [ "CMD-SHELL", "curl -f http://127.0.0.1:2020/api/v1/health || exit 1" ], "interval": 10, "timeout": 5, "retries": 3, "startPeriod": 30 }, "systemControls": [], "firelensConfiguration": { "type": "fluentbit", "options": { "config-file-type": "file", "config-file-value": "/fluent-bit/etc/fluent-bit.conf", "enable-ecs-log-metadata": "true" } } }, { "name": "dnsmasq", "image": ".dkr.ecr..amazonaws.com/:0.4.0", "cpu": 128, "memoryReservation": 100, "portMappings": [], "essential": true, "environment": [], "mountPoints": [], "volumesFrom": [], "linuxParameters": { "capabilities": { "add": [ "NET_ADMIN" ], "drop": [] } }, "stopTimeout": 30, "user": "0", "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/aws/ecs/-", "awslogs-region": "", "awslogs-stream-prefix": "-" } }, "healthCheck": { "command": [ "/etc/healthcheck.sh" ], "interval": 30, "timeout": 5, "retries": 3 }, "systemControls": [] } ], "family": "-ecs-cluster--", "taskRoleArn": "arn:aws:iam:::role/--ecs-task", "executionRoleArn": "arn:aws:iam:::role/--ecs-task-execution", "networkMode": "awsvpc", "revision": 39, "volumes": [ { "name": "cgroup", "host": { "sourcePath": "/sys/fs/cgroup/" } }, { "name": "proc", "host": { "sourcePath": "/proc/" } }, { "name": "docker_sock", "host": { "sourcePath": "/var/run/docker.sock" } } ], "status": "ACTIVE", "requiresAttributes": [ { "name": "ecs.capability.execution-role-awslogs" }, { "name": "com.amazonaws.ecs.capability.ecr-auth" }, { "name": "ecs.capability.firelens.options.config.file" }, { "name": "com.amazonaws.ecs.capability.docker-remote-api.1.17" }, { "name": "com.amazonaws.ecs.capability.docker-remote-api.1.21" }, { "name": "com.amazonaws.ecs.capability.logging-driver.awsfirelens" }, { "name": "com.amazonaws.ecs.capability.task-iam-role" }, { "name": "ecs.capability.container-health-check" }, { "name": "ecs.capability.execution-role-ecr-pull" }, { "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18" }, { "name": "ecs.capability.task-eni" }, { "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29" }, { "name": "com.amazonaws.ecs.capability.logging-driver.awslogs" }, { "name": "com.amazonaws.ecs.capability.docker-remote-api.1.24" }, { "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19" }, { "name": "ecs.capability.firelens.fluentbit" }, { "name": "ecs.capability.secrets.asm.environment-variables" }, { "name": "ecs.capability.container-ordering" } ], "placementConstraints": [], "compatibilities": [ "EC2" ], "requiresCompatibilities": [ "EC2" ], "cpu": "512", "memory": "1024", "registeredAt": "2024-06-11T06:25:16.832Z", "registeredBy": "arn:aws:iam:::user/", "tags": [ { "key": "-version", "value": "1.6.0" }, { "key": "owner", "value": "" }, { "key": "product", "value": "" }, { "key": "SERVICE-NAME-ID", "value": "-" }, { "key": "environment", "value": "" }, { "key": "consumer", "value": "" }, { "key": "managed-by", "value": "terraform" }, { "key": "service-stage", "value": "-" }, { "key": "Name", "value": "" }, { "key": "nickname", "value": "-" }, { "key": "DEPLOYMENT-TIER", "value": "" }, { "key": "environment", "value": "" } ] } ``` #### Fluentbit configuration ``` [SERVICE] HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_PORT 2020 Flush 10 Grace 15 Daemon off Log_Level info Health_Check On HC_Errors_Count 5 HC_Retry_Failure_Count 5 HC_Period 5 [INPUT] Name forward Listen 0.0.0.0 Port 5170 Buffer_Chunk_Size 1M Buffer_Max_Size 6M [INPUT] Name forward unix_path /var/run/fluent.sock Tag_Prefix service_logs. [OUTPUT] Name s3 Match events.* bucket ${AWS_S3_BUCKET} region ${AWS_REGION} total_file_size 25M upload_timeout 10s use_put_object On compression gzip s3_key_format /reports/$TAG[0]/$TAG[1]/%Y/%m/%d/%H/%M/%S/${ECS_TASK_ID}.json.gz json_date_key false static_file_path On retry_limit 1 workers 1 tls Off tls.verify Off [OUTPUT] Name s3 Match dimensions bucket ${AWS_S3_BUCKET} region ${AWS_REGION} total_file_size 25M upload_timeout 10s use_put_object On compression gzip s3_key_format /reports/dimensions/%Y/%m/%d/%H/%M/%S/${ECS_TASK_ID}.json.gz json_date_key false static_file_path On retry_limit 1 workers 1 tls Off tls.verify Off [OUTPUT] Name datadog Match service_logs.* Host http-intake.logs.datadoghq.com TLS on compress gzip provider ecs apikey ${DD_API_KEY} dd_service ${DD_SERVICE} dd_source ${DD_SOURCE} dd_tags ${DD_TAGS} dd_message_key log ``` ### Fluent Bit Log Output There are only logs about successful file upload ``` { "timestamp": 1718150801649, "message": "[2024/06/12 00:06:41] [ info] [output:s3:s3.0] Successfully uploaded object .json.gz", "ingestionTime": 1718150803721 }, ``` ### Fluent Bit Version Info amazon/aws-for-fluent-bit:2.31.11 Application is written in go, and is using github.com/IBM/fluent-forward-go v0.2.2 client to publish messages to fluent-bit ### Cluster Details ECS on EC2 with Sidecar deployment ### Application Details The application at the time of occurring has been sending 26k of smaller messages and 26k if bigger messages to the two instances of fluent bit within around 30 seconds burst. ### Steps to reproduce issue - ### Related Issues Not really sure, maybe this: https://github.com/aws/aws-for-fluent-bit/issues/376