aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.51k stars 3.85k forks source link

(ecs): default to stable fluentbit image instead of latest #16403

Open ayozemr opened 3 years ago

ayozemr commented 3 years ago

:question: General Issue

The Question

I am affected by a bug in Fluentbit that causes DNS problems when using firelens log driver, almost with datadog. (https://github.com/aws/aws-for-fluent-bit/issues/233). As of now, working solution is to not use latest image and use 2.19.0, which I have tested working.

My problem is that using CDK I don't see how I can set to use a fixed fluentbit version in the "log-router" container, as that container definition is created by cloudformation when we specify the logrouter in CDK/CFN in the app container taskdef, the system adds that log-router sidecar container with the version set to latest automaticly.

Its possible to change that sidecar container image from latest to 2.19.0?

Environment

Other information

  logging: ecs.LogDrivers.firelens({
    options: {
      Name: 'datadog',
      provider: 'ecs',
      TLS: 'on',
      dd_message_key: 'log',
      dd_service: 'test_service,
      dd_source: 'java',
      dd_tags: 'env:test',
    },
    secretOptions: {
      apikey: ecs.Secret.fromSsmParameter(
        REDACTED
      ),
    },
  }),

Thanks for your time!!

peterwoodworth commented 3 years ago

Hey @ayozemr,

To help me fully understand your problem, can you share the relevant cdk synth output you'd like to see changed, and can you share the code which creates your container? Thanks!

ayozemr commented 3 years ago

Sure, there we go. I redacted some possible sensitive values.

The thing is the "log-router" container auto created has an image linked to a param out of my control, that checking ECS taskdefinition json, translates to "image": "906394416424.dkr.ecr.us-east-1.amazonaws.com/aws-for-fluent-bit:latest"

So the idea is if there is any possibility that "log-router" image is any other tag instead of "latest" until they fix the bug, so I can make that "log-router" container uses "906394416424.dkr.ecr.us-east-1.amazonaws.com/aws-for-fluent-bit:2.19.0" instead of latest.

Relevant code:

CDK:

    const taskdefFamily = `${buildConfig.App}-${buildConfig.Environment}-dbmigrations`;
    const migrationTaskDef = new ecs.TaskDefinition(this, 'MigrationsTaskDef', {
      family: taskdefFamily,
      compatibility: ecs.Compatibility.FARGATE,
      cpu: '256',
      memoryMiB: '512',
      networkMode: ecs.NetworkMode.AWS_VPC,
      taskRole: taskdefRole,
      executionRole: taskExecutionRole,
    });

    migrationTaskDef.addContainer('FlywayContainer', {
      image: ecs.RepositoryImage.fromEcrRepository(ecrRepository, 'latest'),
      memoryLimitMiB: 512,
      environment: {
        ENV: buildConfig.Environment,
        FLYWAY_URL: `jdbc:postgresql://${dbcredentials
          .secretValueFromJson('host')
          .toString()}:${dbcredentials
          .secretValueFromJson('port')
          .toString()}/db`,
      },
      logging: ecs.LogDrivers.firelens({
        options: {
          Name: 'datadog',
          provider: 'ecs',
          TLS: 'on',
          dd_message_key: 'log',
          dd_service: `${buildConfig.App}-Migrations`,
          dd_source: 'java',
          dd_tags: `env:${buildConfig.Environment}`,
        },
        secretOptions: {
          apikey: ecs.Secret.fromSsmParameter(
            ssm.StringParameter.fromStringParameterName(
              this,
              'DatadogApiKey',
              '/datadog/APIKEY'
            )
          ),
        },
      }),
    });

// ----------------------------- Synth cloudformation:

  MigrationsTaskDef1F4C395A:
    Type: AWS::ECS::TaskDefinition
    Properties:
      ContainerDefinitions:
        - Environment:
            - Name: ENV
              Value: stage
            - Name: FLYWAY_URL
              Value: REDACTED
          Essential: true
          Image:
            Fn::Join:
              - ""
              - - Fn::Select:
                    - 4
                    - Fn::Split:
                        - ":"
                        - Fn::GetAtt:
                            - MigrationsServiceDECCB9A7
                            - Arn
                - .dkr.ecr.
                - Fn::Select:
                    - 3
                    - Fn::Split:
                        - ":"
                        - Fn::GetAtt:
                            - MigrationsServiceDECCB9A7
                            - Arn
                - "."
                - Ref: AWS::URLSuffix
                - /
                - Ref: MigrationsServiceDECCB9A7
                - :latest
          LogConfiguration:
            LogDriver: awsfirelens
            Options:
              Name: datadog
              provider: ecs
              TLS: "on"
              dd_message_key: log
              dd_service: svc-Migrations
              dd_source: java
              dd_tags: env:stage
            SecretOptions:
              - Name: apikey
                ValueFrom: REDACTED-PARAM
          Memory: 512
          Name: app
        - Essential: true
          FirelensConfiguration:
            Type: fluentbit
          Image:
            Ref: SsmParameterValueawsserviceawsforfluentbitlatestC96584B6F00A464EAD1953AFF4B05118Parameter
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group:
                Ref: MigrationsTaskDeflogrouterLogGroupAFD29B37
              awslogs-stream-prefix: firelens
              awslogs-region: us-east-1
          MemoryReservation: 50
          Name: log-router
      Cpu: "256"
      ExecutionRoleArn:
        Fn::GetAtt:
          - TaskExecRoleA2BBB60C
          - Arn
      Family: REDACTED
      Memory: "512"
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      Tags:
        - Key: app
          Value: appname
        - Key: environment
          Value: stage
        - Key: stack
          Value: stackname
      TaskRoleArn:
        Fn::GetAtt:
          - TaskRole30FC0FBB
          - Arn
    Metadata:
      aws:cdk:path: stackname/MigrationsTaskDef/Resource
peterwoodworth commented 3 years ago

This is where the image is set: https://github.com/aws/aws-cdk/blob/026cb8f1aa6b915969e22050968b0719c83aa2a4/packages/%40aws-cdk/aws-ecs/lib/base/task-definition.ts#L679-L696

The obtainDefaultFluentBitECRImage function used for the image here will return this image: https://github.com/aws/aws-cdk/blob/026cb8f1aa6b915969e22050968b0719c83aa2a4/packages/%40aws-cdk/aws-ecs/lib/firelens-log-router.ts#L179-L184

fluentBitImageTag will always be latest in this case, because imageTag isn't passed in as an argument to obtainDefaultFluentBitECRImage(...). So there's no direct way to modify this by the user before the CDK creates the CloudFormation output with the latest image.

Unfortunately, the use of escape hatches won't work here because CfnTaskDefinition.containerDefinitions is a lazy value. I'll have to research to see if there's a workaround for this.

skinny85 commented 3 years ago

Hey @ayozemr ,

I think you can use the public addFirelensLogRouter() method of TaskDefinition to supply the FireLens router, using any image you want. Something like:

        migrationTaskDef.addFirelensLogRouter('log-router', {
            image: your-image-here,
            firelensConfig: {
                type: FirelensLogRouterType.FLUENTBIT,
            },
            logging: new AwsLogDriver({ streamPrefix: 'firelens' }),
            memoryReservationMiB: 50,
        });

Thanks, Adam

skinny85 commented 3 years ago

BTW, I think this should morph into a feature request to be able to more easily customize the image for the firelens() method.

ayozemr commented 3 years ago

Thanks @peterwoodworth for your time.

@skinny85 I have tried that but ended having problem when pulling the non latest image 🤷 . Had to use fromRepositoryArn because fromRepositoryName was adding my account id by default, ending in an incorrect image uri.

Its weird but:

Reading official repo seems there is a public image, but looking at its URI I dont know how I can build its ARN to be able to use: https://github.com/aws/aws-for-fluent-bit -> public.ecr.aws/aws-observability/aws-for-fluent-bit:

So your solution works to build using another image, but I had no luck with the pull process in fargate...

Code used:

    const awsFluentRepo = ecr.Repository.fromRepositoryArn(
      this,
      'AwsFluentRepo',
      'arn:aws:ecr:us-east-1:906394416424:/aws-for-fluent-bit'
      // '906394416424.dkr.ecr.us-east-1.amazonaws.com/aws-for-fluent-bit:2.19.0'
    );

    migrationTaskDef.addFirelensLogRouter('log-router', {
      image: ecs.ContainerImage.fromEcrRepository(awsFluentRepo, '2.19.0'),
      firelensConfig: {
        type: ecs.FirelensLogRouterType.FLUENTBIT,
      },
      logging: new ecs.AwsLogDriver({ streamPrefix: 'firelens' }),
      memoryReservationMiB: 50,
    });

Something to override firelens image would be useful, especially in cases like this related to a bug. Maybe also something to use a container image using its URI instead of arn, would let me use that public ecr image

Thanks for your time all!!

skinny85 commented 3 years ago

@ayozemr I think what you're looking for here is the ContainerImage.fromRegistry() method.

So something like:

    migrationTaskDef.addFirelensLogRouter('log-router', {
      image: ecs.ContainerImage.fromRegistry('public.ecr.aws/aws-observability/aws-for-fluent-bit:2.19.0'),
      // ...
ayozemr commented 3 years ago

Oh yes! That did it 👏 Much appreciated!! Thanks for your time!

About changing the issue to feature request, I don't know how to do it. I can close this one and create another referencing this if its ok for you

blimmer commented 3 years ago

I just ran into this problem and opened https://github.com/aws/aws-cdk/pull/16439 to default to the stable image instead of latest.

peterwoodworth commented 3 years ago

Great, I've converted this into a feature request. Changing the image to default to stable seems like a fine solution to me

madeline-k commented 2 years ago

Closing this issue, since the upstream issue with fluentbit has been resolved. And we are no longer planning to implement this change in the default.

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

tessro commented 1 year ago

I know this is closed as wontfix, but I figured I'd post anyway since the AWS for Fluent Bit image broke again: https://github.com/aws/aws-for-fluent-bit/issues/491

This caused an outage for us that would have been avoided by using stable.

Is using bleeding-edge Fluent Bit, at increased risk of instability, really worth it? stable only tracks behind by 14 days, is this really a breaking change that needs to be avoided? It would probably help a lot of people like me who had to learn way more than they ever wanted about the internals of Fluent Bit images in order to address this issue today.

blimmer commented 1 year ago

Per the comment visibility warning, cc @madeline-k and @peterwoodworth RE: @paulrosania's comment above.

My PR was closed because changing from latest to stable would technically be a breaking change. However, since it broke again, I think it's worth making the breaking change, even behind a feature flag to make people's lives easier. Having your ECS tasks fail to start up twice in two years because of problems with latest seems unacceptable IMO.

MrArnoldPalmer commented 1 year ago

We can possibly implement this behind a feature flag to avoid the breakage. Should be a fairly simple contribution if someone wants to take a crack at it.