aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.67k stars 3.92k forks source link

aws-ecs: LogDriver automatically creates log-resource-policies, exhausting unmodifiable resource limit #22307

Closed trks1970 closed 10 months ago

trks1970 commented 2 years ago

Describe the bug

We are running 12 Fargate containers, each of them logging to CloudWatch via LogDriver, creare with cdk.

However, the last 2 fails to deploy, due to "Resource limit exceeded. (Service: CloudWatchLogs, Status Code: 400,

which I understand is fixed at 10 and that is it. No way to change.

However, it looks like log-resource-policy is created automaticall by LogDriver, which uses up the amount.

Expected Behavior

It is not possible a) not to create a policy or b) reuse a policy

Current Behavior

A log-resource-policy is created with each LogDriver.awsLogs call.

Reproduction Steps

Deploy 10 Containers with LogDriver

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.40.0 (build 56ba2ab)

Framework Version

aws-cdk-lib:2.27.0

Node.js Version

16.15.1

OS

Win10

Language

Java

Language Version

Kotlin

Other information

No response

pahud commented 1 year ago

@trks1970 This makes sense to me. Before we dive deep into the fix, can you provide reproduction steps with CDK code that can help us quickly reproduce this issue from our end?

github-actions[bot] commented 1 year ago

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

trks1970 commented 1 year ago

Will take some time to provide an example.

mvs5465 commented 1 year ago

Hi @pahud we are running into this too with some CDK-created CloudWatch log group policies:

Example 1, sending cloudtrail events to eventbridge and logging the events:

const loginEvent = aws_cloudtrail.Trail.onEvent(this, 'login-event', {
  eventPattern: {
    source: ['aws.signin'],
    detailType: ['AWS Console Sign In via CloudTrail'],
  },
});

loginEvent.addTarget(new targets.CloudWatchLogGroup(new logs.LogGroup(this, 'login-event-logs')));

Example 2, creating an OpenSearch domain:

new aws_opensearchservice.Domain(this, 'domain', {
  // Specifically enabling logging here
  logging: {
    slowSearchLogEnabled: true,
    appLogEnabled: true,
    slowIndexLogEnabled: true,
  },
});
pahud commented 1 year ago

@mvs5465

Are you having the Resource limit exceeded error just with this? Can you provide a full sample that reproduces this error?

const loginEvent = aws_cloudtrail.Trail.onEvent(this, 'login-event', {
  eventPattern: {
    source: ['aws.signin'],
    detailType: ['AWS Console Sign In via CloudTrail'],
  },
});

loginEvent.addTarget(new targets.CloudWatchLogGroup(new logs.LogGroup(this, 'login-event-logs')));
mvs5465 commented 1 year ago

@pahud We already have 10 cloudwatch log group policies in our account. Running the code you mentioned will create one more, and thus the error.

pahud commented 1 year ago

OK I see.

Looking at this https://github.com/aws/aws-cdk/blob/cea1039e3664fdfa89c6f00cdaeb1a0185a12678/packages/%40aws-cdk/aws-ecs/lib/log-drivers/aws-log-driver.ts#L119

When we new AwsLogDriver() for a container definition, there will always be a grantWrite() execution, adding a new log group policy to this log group resource.

https://github.com/aws/aws-cdk/blob/be97e4e5006990c7340025618b03f5d5eacda253/packages/%40aws-cdk/aws-logs/lib/log-group.ts#L194

I don't have immediate workaround and I am making it p1 here for more visibility. Meanwhile I will try reproduce this in my account by creating 10 containers with the same log group.

pahud commented 1 year ago

@trks1970

I was trying to create 10 containers with 2 approaches below and I didn't hit the error you mentioned.

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';

export class EcsTsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const logGroup = new cdk.aws_logs.LogGroup(this, 'LG')
    const task = new cdk.aws_ecs.FargateTaskDefinition(this, 'Task', {
      cpu: 512,
      memoryLimitMiB: 4096,
    });

    const task2 = new cdk.aws_ecs.FargateTaskDefinition(this, 'Task2', {
      cpu: 512,
      memoryLimitMiB: 4096,
    });

    // this works
    for (let x = 0; x < 10; x++) {
      task.addContainer(`nginx${x}`, {
        image: cdk.aws_ecs.ContainerImage.fromRegistry('nginx:latest'),
        memoryLimitMiB: 256,
        logging: new cdk.aws_ecs.AwsLogDriver({
          logGroup,
          streamPrefix: 'demo-',
        })
      });
    };

    // this works as well
    for (let x = 0; x < 10; x++) {
      task2.addContainer(`nginx${x}`, {
        image: cdk.aws_ecs.ContainerImage.fromRegistry('nginx:latest'),
        memoryLimitMiB: 256,
        logging: new cdk.aws_ecs.AwsLogDriver({
          logGroup: new cdk.aws_logs.LogGroup(this, `LG${x}`),
          streamPrefix: 'demo-',
        })
      });
    };

  }
}

Can you provide me with the full cdk code that reproduces your error?

pahud commented 1 year ago

OK I can reproduce this as below:

    // create 10 events each having a loggroup as the event target
    for (let x = 0; x < 10; x++) {
      const loginEvent = cdk.aws_cloudtrail.Trail.onEvent(this, `login-event${x}`, {
        eventPattern: {
          source: ['aws.signin'],
          detailType: ['AWS Console Sign In via CloudTrail'],
        },
      });

      loginEvent.addTarget(new cdk.aws_events_targets.CloudWatchLogGroup(
        new cdk.aws_logs.LogGroup(this, `login-event-logs${x}`),
      ));
    }

This will create 10 cloudwatch logs resource policies and according to the doc the hard limit is 10 thus the error.

The resources can be listed with aws logs describe-resource-policies.

I think this is not a bug but a limitation of cloudwatch logs resources. From CDK's perspective we probably can have some workaround such as opt out the resource policy creation but I am not sure if this is a good idea.

I am making this as p1 feature request for more visibility from the team.

vinayak-kukreja commented 1 year ago

Similar Issue: https://github.com/aws/aws-cdk/issues/20313

marinecode commented 1 year ago

I have come up with a possible workaround. When I moved LogGroup creation to the separate stack, ResourcePolicy was not created. In the TaskDefinition stack, I referenced created LogGroup by arn via LogGroup.from_log_group_arn().

mramirez-fdb commented 11 months ago

Any Update on this?

XiaowenMaoA commented 10 months ago

@sakurai-ryo just seeing pull request on this. when do we expect this change to available? any local workaround we can have? Thanks

sakurai-ryo commented 10 months ago

@XiaowenMaoA The CDK core team is currently on vacation, so the review will be done after the vacation. Please check the other issues for workarounds. https://github.com/aws/aws-cdk/issues/20313#issuecomment-1861126356

If this doesn't work, please send me the code and I may be able to suggest a workaround.

XiaowenMaoA commented 10 months ago

@sakurai-ryo we are not inside ECS task. create OS domain logic is like following. we have TenantNames over 10 and we create OpenSearch Domain for each TenantName.

import { Domain, EncryptionAtRestOptions, EngineVersion } from 'aws-cdk-lib/aws-opensearchservice';

export class OpenSearchClusterStack extends DeploymentStack {
  public kmsKey: KmsKey;
  public cloudwatchPolicy: PolicyStatement;

  constructor(parent: App, name: string, props: OpenSearchClusterStackProps) {
    super(parent, name, {
      env: props.env,
      softwareType: SoftwareType.LONG_RUNNING_SERVICE,
      stackName: props.stackName,
      tags: props.tags,
    });

    vpc = Utils.getVpc(this, this.region);

    const securityGroup = this.createSecurityGroup(vpc, props);

    this.kmsKey = this.createKmsKey(vpc, props);

    this.cloudwatchPolicy = new PolicyStatement({
      actions: ['logs:CreateLogStream', 'logs:PutLogEvents'],
      resources: [
        '*',
      ],
      effect: Effect.ALLOW,
      principals: [new ServicePrincipal('es.amazonaws.com')],
    });

  // create OS domain for each TenantName 
    Object.values(TenantNames).forEach((tenantNameStr) => {
      const tenantName = tenantNameStr as TenantNames;
      const clusterConfig = props.stage.clusterConfigByTenant[tenantName];
      const globaltenantConfig = TENANT_CONFIG[tenantName];
      const domainName = Utils.generateName(globaltenantConfig.clusterName, props.isOnebox)
        .toLowerCase();

      const domainAppLogGroupName = `${Utils.generateName(`${Constants.Resource.PREFIX}-OSDomain-AppLogGroup`, props.isOnebox)}-${domainName}`;
      const domainAppLogGroup = new LogGroup(this, domainAppLogGroupName, {
        logGroupName: domainAppLogGroupName,
        retention: RetentionDays.TEN_YEARS,
        removalPolicy: RemovalPolicy.RETAIN_ON_UPDATE_OR_DELETE,
      });

      const domain = this.createDomain(
        domainName,
        { enabled: true, kmsKey: this.kmsKey.key },
        domainAppLogGroup,
        vpc,
        securityGroup,
        props,
        clusterConfig,
      );
      domain.addAccessPolicies(this.cloudwatchPolicy);
    });
  }

  private createDomain(
    domainName: string,
    encryptionProperties: EncryptionAtRestOptions,
    appLogGroup: LogGroup,
    vpc: IVpc,
    securityGroup:SecurityGroup,
    props: OpenSearchClusterStackProps,
    clusterConfig: ClusterConfig,
  ) {

    return new Domain(this, domainName, {
      domainName,
      version: EngineVersion.OPENSEARCH_1_1,
      vpc,
      zoneAwareness: {
        enabled: true,
        availabilityZoneCount: vpc.privateSubnets.length,
      },
      capacity: {
        masterNodes:  clusterConfig.masterNodeCount,
        masterNodeInstanceType: clusterConfig.masterNodeType,
        dataNodes: clusterConfig.instanceCount,
        dataNodeInstanceType: clusterConfig.instanceType,
      },
      ebs: {
        volumeSize: clusterConfig.instanceVolumeSize,
        volumeType: clusterConfig.ebsVolumeType,
        iops: 18000,
      },
      securityGroups: [securityGroup],
      nodeToNodeEncryption: true,
      encryptionAtRest: encryptionProperties,
      enforceHttps: true,
      logging: {
        appLogEnabled: true,
        appLogGroup,
      },
      removalPolicy: RemovalPolicy.RETAIN_ON_UPDATE_OR_DELETE,
    });
  }
sakurai-ryo commented 10 months ago

Thanks @XiaowenMaoA

Looking at this part of the code, it appears that we are currently using CustomResource to create the ResourcePolicy for the logs. https://github.com/aws/aws-cdk/blob/9c0fecf9a5cc26416e814ae4754729b54c827b9d/packages/aws-cdk-lib/aws-opensearchservice/lib/domain.ts#L1717

Since CustomResource is executed by CloudFormation after cdk deploy, it is difficult to remove the ResourcePolicy in the CDK code, for example by using the tryRemoveChild method. It may be possible to work around this by using a new custom resource and rewriting or deleting the ResourcePolicy, though.

The PR I created is an ECS fix, and even if it were merged, it would not solve the problem on the OpenSearch side. Would you be so kind as to create a new issue for us to discuss this matter further? In addition to the fact that ResourcePolicy can currently be created in CloudFormation without the custom resource, ResourcePolicy should be reused as described in this document. https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createdomain-configure-slow-logs.html#:~:text=name%22%0A%20%20%20%20%7D%0A%7D-,Important,-CloudWatch%20Logs%20supports

XiaowenMaoA commented 10 months ago

@sakurai-ryo i found one related issue created before, https://github.com/aws/aws-cdk/issues/23637 Can we use this to track log-resource-policy in OpenSearch case? this is blocked our prod readiness coz we are scaling up to > 10 domain for sure.

i can not find a way to pass the shared ResourcePolicy and with logging enabled. Is there any possible workaround?

github-actions[bot] commented 10 months ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

sakurai-ryo commented 10 months ago

@XiaowenMaoA Yes, of course. As I mentioned in my previous comment, it may be difficult to avoid this problem. I would consider a PR to solve this problem.

stoicad commented 6 months ago

Any update on this? we don't want to add policies on our roles, but we are forced to do it when defining the logs..

sakurai-ryo commented 6 months ago

Hi @stoicad. Do you still have this problem when using aws-log-driver?

stoicad commented 6 months ago

Hi @sakurai-ryo, yes we do. We are not allowed to touch the existing roles via CDK, but when defining this one:

public createContainer(taskDefinition: FargateTaskDefinition): void {
    //TODO: fix logging on container
    //Tried with Log Group ARN but still not working: "arn:aws:logs:us-east-1::log-group:/ecs/web-taskdefinition:*"
    //logging: LogDrivers.awsLogs({ streamPrefix: `project-name-${this.ecsContextParams.environment}-web` })

    var container = taskDefinition.addContainer(`project-name-${this.ecsContextParams.environment}-web-container`, {
      image: ContainerImage.fromRegistry(this.envConfig.WEB_ECS_IMAGE),
   //logging: LogDrivers.awsLogs({ streamPrefix: `project-name-${this.ecsContextParams.environment}-web` })
    });

    container.addPortMappings({ containerPort: 80 });
    container.addPortMappings({ containerPort: 443 });
  }

it tries to add permissions to the role attached to the ECS (as a separate policy), although the logs:* is already available on the role.

sakurai-ryo commented 6 months ago

Thanks @stoicad. Does the existing roles refer to the ECS execution role? Could you share the entire code and errors to help us pinpoint the problem?