aws / aws-for-fluent-bit

The source of the amazon/aws-for-fluent-bit container image
Apache License 2.0
462 stars 134 forks source link

fluent-bit init uses `cloudwatch` plugin not specified in the config #835

Open borkod opened 5 months ago

borkod commented 5 months ago
### Describe the question/issue - Issue 1 Fluent bit logs show error `AccessDeniedException` because it tries to create a log group that it is not allowed / is not configured: ``` time="2024-06-13T18:48:52Z" level=error msg="AccessDeniedException: User: arn:aws:sts::xxxxxxxxxxxx:assumed-role/fluentbit-task-role/xxxxxxx is not authorized to perform: logs:CreateLogGroup on resource: arn:aws:logs:us-east-1:xxxxxxxxxxxx:log-group:fluent-bit-cloudwatch:log-stream: because no identity-based policy allows the logs:CreateLogGroup action\n\tstatus code: 400, request id: xxxxxxxx" ``` However, our output plugin setting is: ``` [OUTPUT] Name cloudwatch_logs Match * region ca-central-1 log_group_name testname log_stream_name teststream auto_create_group false Retry_Limit no_limits ``` During fluent-bit startup we see following logs: ``` [2024/06/13 19:22:22] [ info] cloudwatch.0 ... time="2024-06-13T19:22:22Z" level=info msg="[cloudwatch 0] plugin parameter auto_create_stream = 'true'" time="2024-06-13T19:22:22Z" level=info msg="[cloudwatch 0] plugin parameter auto_create_group = 'true'" ... time="2024-06-13T19:22:22Z" level=info msg="[cloudwatch 0] plugin parameter region = 'us-east-1'" ... time="2024-06-13T19:22:22Z" level=info msg="[cloudwatch 0] plugin parameter default_log_group_name = 'fluentbit-default'" time="2024-06-13T19:22:22Z" level=info msg="[cloudwatch 0] plugin parameter log_group_name = 'fluent-bit-cloudwatch'" ``` Our configuration only uses the newer `cloudwatch_logs` plugin. We do not specify or use the `cloudwatch` plugin. It seems that the `cloudwatch` plugin is being used for some reason as well, even though it is not being specified by us. It is using some config that specifies `us-east-1` region and `fluent-bit-cloudwatch` log group, as shown in the logs. This then causes the denied exception error. In regards to our specified `cloudwatch_logs` plugin - we are seeing logs written to the specified log group / log stream correctly. - Issue 2 As shown above in the output config, we set the `Retry_Limit` to `no_limits`. However, logs show: ``` [2024/06/13 19:31:07] [ warn] [engine] chunk '1-1718307049.471694794.flb' cannot be retried: task_id=0, input=syslog.1 > output=cloudwatch.0 [2024/06/13 19:31:07] [debug] [task] task_id=0 reached retry-attempts limit 1/1 ``` Earlier startup logs show: ``` [2024/06/13 19:30:50] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] task_id=0 assigned to thread #0 ``` It's not completely clear to me whether the `task_id=0 reached retry-attempts limit 1/1` is referencing `cloudwatch_logs` plugin. If so, then why is it not respecting our `Retry_Limit no_limits` setting? (We've also tried different settings, e.g. `5` instead of `no_limits`). Or is the `task_id=0 reached retry-attempts limit 1/1` related to the previous error line that references `cloudwatch.0`, which means that it is also related to our mysterious `cloudwatch` plugin. ### Configuration

ECS Config:

resource "aws_ecs_service" "fluentbit" {
  name            = "fluentbit"
  task_definition = aws_ecs_task_definition.fluentbit.arn
  cluster = aws_ecs_cluster.fluentbit.id
  launch_type = "FARGATE"
  desired_count = 2
  enable_execute_command = true

  network_configuration {
    assign_public_ip = false

    security_groups = [
      aws_security_group.fluentbit-container-sg.id,
    ]

    subnets = [
      data.aws_ssm_parameter.subnet1.value,
      data.aws_ssm_parameter.subnet2.value,
    ]
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.fluentbit_ecs_syslog_tg.arn
    container_name   = "fluentbit"
    container_port   = "5140"
  }
}

resource "aws_ecs_task_definition" "fluentbit" {
  family = "fluentbit"

  container_definitions = jsonencode([{
    name = "fluentbit"
    essential = true
    #readonlyRootFilesystem = true    can't be enabled because AWS fargate in the s3 init files https://github.com/fluent/fluent-bit/issues/7308
    image = "${data.aws_ssm_parameter.fluent-latest-image.value}"
    entrypoint = ["/bin/sh","-c"]
    command = ["/init/fluent_bit_init_entrypoint.sh"]
    environment = [
      {
        name = "aws_fluent_bit_init_s3_1"
        value = "${aws_s3_bucket.syslog-config.arn}/fluent/syslog-fluent-base.conf"
      },
      {
        name = "aws_fluent_bit_init_s3_2"
        value = "${aws_s3_bucket.syslog-config.arn}/fluent/syslog-fluent-input.conf"
      },
      {
        name = "aws_fluent_bit_init_s3_3"
        value = "${aws_s3_bucket.syslog-config.arn}/fluent/syslog-fluent-parser.conf"
      },
      {
        name = "aws_fluent_bit_init_s3_4"
        value = "${aws_s3_bucket.syslog-config.arn}/fluent/syslog-fluent-output.conf"
      }
    ] 
    portMappings = [{
      containerPort = 5140
      hostPort = 5140
      protocol = "tcp"
    },{
      containerPort = 2020
      hostPort = 2020
      protocol = "tcp"
    }]
    healthcheck = {
      command = ["CMD-SHELL","curl -f http://localhost:2020/api/v1/health || exit 1"] 
      interval = 60
      timeout = 5
      retries = 3
      start_period = 90
    } 
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-region" = "ca-central-1"
        "awslogs-group" = "${aws_cloudwatch_log_group.ecs_fluentbit_service.id}"
        "awslogs-stream-prefix" = "ecs"
      }
    }
  }])

Fluent Bit Log Output

See above.

Fluent Bit Version Info

Container: aws-for-fluent-bit:init-latest Fluent-bit version: Fluent Bit v1.9.10

Cluster Details

Application Details

Steps to reproduce issue

Related Issues

MrHash commented 2 weeks ago

You might need to override the default fluent-bit config as explained here or possibly use this approach https://github.com/aws-samples/amazon-ecs-firelens-examples/blob/mainline/examples/fluent-bit/health-check/task-definition-output-metrics-healthcheck.json#L14