aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.4k stars 3.79k forks source link

aws_applicationautoscaling: Error: Only direct metrics are supported for Target Tracking. Use Step Scaling or supply a Metric object. #20659

Open mostafafarzaneh opened 2 years ago

mostafafarzaneh commented 2 years ago

Describe the bug

I would like to use a MathExpression for custom metric in TargetTrackingScalingPolicy, but I got this error:

Only direct metrics are supported for Target Tracking. Use Step Scaling or supply a Metric object.

checking the code here, it only checks for metricStat not mathExpression.

Expected Behavior

Should allow to define math expression for Target Tracking.

Current Behavior

Only direct metrics are allowed

Reproduction Steps

Create Target Tracking using math expression

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.27.0

Framework Version

No response

Node.js Version

16.15.0

OS

Debian 10

Language

Python

Language Version

No response

Other information

No response

mostafafarzaneh commented 2 years ago

I also tried to create a metric this way:

   custom_metric = cloudwatch.MathExpression(
      expression='SELECT AVG(ActiveConnections) FROM "myMetrics/custom"',
      period=Duration.minutes(1),
   )

and use it in StepScalingPolicy. CDK complainse:

Alarm contains invalid expressions. (Service: AmazonCloudWatch; Status Code: 400; Error Code: ValidationError; Request ID: 3c245f6f-9d5e-492e-b2e1-e0fa83422594; Proxy: null)

peterwoodworth commented 2 years ago

These properties are directly passed to the ScalingPolicy CloudFormation resource in this property.

Our Metric class supports these properties, while our MathExpression class doesn't. I think we would need additional functionality from cloudformation for this to be implemented

shw1n commented 1 year ago

Also encountering this issue

matthias-pichler-warrify commented 1 year ago

Our Metric class supports these properties, while our MathExpression class doesn't. I think we would need additional functionality from cloudformation for this to be implemented

It seems like CloudFormation's AWS::AutoScaling::ScalingPolicy is indeed lacking some configuration parameters. The CustomizedMetricSpecification type from the AutoScaling API has a member Metrics where expressions can be specified like seen in the docs. On the other hand AWS::AutoScaling::ScalingPolicy CustomizedMetricSpecification does NOT have the Metrics property.

hectorsouthern commented 1 year ago

It looks like AWS announced support for this recently: https://www.amazonaws.cn/en/new/2023/application-auto-scaling-supports-metric-math-for-target-tracking-policies/

and the documentation is now available: https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking-metric-math.html

It would be great to see this feature in CDK too

SamStephens commented 1 year ago

It looks like AWS announced support for this recently: https://www.amazonaws.cn/en/new/2023/application-auto-scaling-supports-metric-math-for-target-tracking-policies/

and the documentation is now available: https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking-metric-math.html

It would be great to see this feature in CDK too

As per the second page you link to, "This feature is not yet available in AWS CloudFormation.". So CDK either has to wait for Cloudformation support, or provide this via a custom resource.

zubairzahoor commented 12 months ago

Any updates on this? Or any workarounds via CDK as of now?

alexbaileyuk commented 11 months ago

@zubairzahoor

I came across this today whilst folloing the and lost a few hours on it. As a temporary workaround, I've added a custom resource using AwsCustomResource. Not ideal but could be an option for you in the mean time.

import { AwsCustomResource, AwsCustomResourcePolicy, PhysicalResourceId } from 'aws-cdk-lib/custom-resources';
import { Construct } from 'constructs';
import { Metric } from 'aws-cdk-lib/aws-cloudwatch';
import { IQueue } from 'aws-cdk-lib/aws-sqs';
import { Effect, PolicyStatement } from 'aws-cdk-lib/aws-iam';
import { buildVersion } from '../../../utils/build-version';

interface EcsSqsMathExpressionAutoScalingPolicyProps {
  targetValue: number;
  resourceId: string; // format service/{cluster_name}/{service_name}
  queue: IQueue;
  taskCountMetric: Metric;
}

export class EcsSqsMathExpressionAutoScalingPolicy extends Construct {
  constructor(scope: Construct, id: string, props: EcsSqsMathExpressionAutoScalingPolicyProps) {
    super(scope, id);

    new AwsCustomResource(this, 'scaling-put-autoscaling-policy', {
      onUpdate: {
        physicalResourceId: PhysicalResourceId.of(`sqs-backlog-scaling-policy/${props.resourceId}`),
        service: 'ApplicationAutoScaling',
        action: 'putScalingPolicy',
        parameters: {
          PolicyName: `sqs-backlog-scaling-policy-${props.resourceId}-${buildVersion}`,
          PolicyType: 'TargetTrackingScaling',
          ResourceId: props.resourceId,
          ScalableDimension: 'ecs:service:DesiredCount',
          ServiceNamespace: 'ecs',
          TargetTrackingScalingPolicyConfiguration: {
            TargetValue: props.targetValue,
            CustomizedMetricSpecification: {
              Metrics: [
                {
                  Id: 'm1',
                  Label: 'Appox. # of Messages Visible',
                  ReturnData: false,
                  MetricStat: {
                    Stat: 'Sum',
                    Metric: {
                      MetricName: props.queue.metricApproximateNumberOfMessagesVisible().metricName,
                      Namespace: props.queue.metricApproximateNumberOfMessagesVisible().namespace,
                      Dimensions: [
                        {
                          Name: 'QueueName',
                          Value: props.queue.queueName
                        }
                      ]
                    }
                  }
                },
                {
                  Id: 'm2',
                  Label: 'Running Instances Count',
                  ReturnData: false,
                  MetricStat: {
                    Stat: 'Average',
                    Metric: {
                      MetricName: props.taskCountMetric.metricName,
                      Namespace: props.taskCountMetric.namespace,
                      Dimensions: Object.entries(props.taskCountMetric.dimensions || {}).map(([key, value]) => ({
                        Name: key,
                        Value: value
                      }))
                    }
                  }
                },
                {
                  Label: 'Backlog per Instance',
                  Id: 'e1',
                  Expression: 'm1 / m2',
                  ReturnData: true
                }
              ]
            }
          }
        }
      },
      policy: AwsCustomResourcePolicy.fromStatements([
        new PolicyStatement({
          effect: Effect.ALLOW,
          actions: ['application-autoscaling:*', 'ecs:DescribeServices', 'ecs:UpdateService'],
          resources: ['*']
        })
      ])
    });
  }
}

You can use it like this:

    this.scaling = this.fargateService.autoScaleTaskCount({
      minCapacity: 0,
      maxCapacity: 100
    });

    const customScalingPolicy = new EcsSqsMathExpressionAutoScalingPolicy(this, 'scaling-policy', {
      targetValue: props.acceptableLatency.toSeconds() / props.averageMessageProcessingTime.toSeconds(),
      resourceId: `service/${props.cluster.clusterName}/${this.fargateService.serviceName}`,
      queue: queue,
      taskCountMetric: desiredCountMetric
    });

    customScalingPolicy.node.addDependency(this.scaling);

It may need some adaptations to meet your needs but it should give you a good starting point.

I should mention I've not fully tested this yet so if you notice anything weird then please share :)

zubairzahoor commented 11 months ago

@alexbaileyuk Thank you! Tried this for my use-case (with AmazonMq/ECS) and seems to work. What are the minimum permissions needed for execution role of the lambda here?

alexbaileyuk commented 11 months ago

@zubairzahoor due to difficulties with this method I ended up writing a totally different function which pre-calculates backlog / instance by pulling and calculating. Something like this:

import { DescribeServicesCommand, ECSClient, paginateListServices } from '@aws-sdk/client-ecs';
import { CloudWatchClient, MetricDatum, PutMetricDataCommand } from '@aws-sdk/client-cloudwatch';
import { SQSClient, GetQueueAttributesCommand } from '@aws-sdk/client-sqs';

const ecsClient = new ECSClient({
  region: 'eu-west-1'
});

const cloudwatchClient = new CloudWatchClient({
  region: 'eu-west-1'
});

const sqsClient = new SQSClient({
  region: 'eu-west-1'
});

export const putInstanceBacklogMetrics = async (clusterName: string) => {
  const consumerServices = await listServices(clusterName);

  const backlogMetrics = await Promise.all(consumerServices.map((serviceArn) => calculateBacklogForConsumerService(clusterName, serviceArn)));

  const metrics = backlogMetrics.map((backlog) => {
    console.log(`Service ${backlog.serviceName} has desired count ${backlog.desiredCount} and queue length ${backlog.queueLength}`);

    let instanceBacklog = null;

    if (backlog.desiredCount === 0 && backlog.queueLength > 0) {
      // If there are no instances running we have to pretend the backlog is the acceptable backlog per instance + 1
      // so that we scale up to one instance. This allows us to scale down to zero instances when there is no backlog.
      // This will cause some jitter in the instance backlog metric, but it allows us to scale to zero. In test environments
      // it'll be fine, in production we'll have enough traffic that the jitter will be negligible and instances will usually be
      // scaled up to at least one.
      instanceBacklog = backlog.queueLength > backlog.acceptableBacklogPerInstance ? backlog.queueLength : backlog.acceptableBacklogPerInstance + 1;
    } else if (backlog.queueLength === 0) {
      instanceBacklog = 0;
    } else if (backlog.desiredCount > 0) {
      instanceBacklog = backlog.queueLength / backlog.desiredCount;
    } else {
      instanceBacklog = 0;
    }

    return {
      MetricName: 'ConsumerInstanceBacklog',
      Dimensions: [
        {
          Name: 'ClusterName',
          Value: clusterName
        },
        {
          Name: 'ServiceName',
          Value: backlog.serviceName
        }
      ],
      Value: instanceBacklog
    };
  });

  if (metrics.length === 0) {
    console.log('No consumer services found');
    return;
  }

  await putConsumerInstanceBacklogMetric(metrics);
};

const listServices = async (clusterName: string) => {
  const paginator = paginateListServices({ client: ecsClient }, { cluster: clusterName });

  const serviceArns: string[] = [];

  for await (const page of paginator) {
    for (const serviceArn of page.serviceArns ?? []) {
      if (await isConsumerService(clusterName, serviceArn)) {
        serviceArns.push(serviceArn);
      }
    }
  }

  console.log(`Found ${serviceArns.length} consumer services in cluster ${clusterName}`);

  return serviceArns;
};

const isConsumerService = async (clusterName: string, serviceArn: string) => {
  const serviceDetails = await ecsClient.send(
    new DescribeServicesCommand({
      cluster: clusterName,
      services: [serviceArn],
      include: ['TAGS']
    })
  );

  return (
    serviceDetails.services?.[0].tags?.find((tag) => tag.key === 'QueueUrl') !== undefined &&
    serviceDetails.services?.[0].tags?.find((tag) => tag.key === 'AcceptableBacklogPerInstance') !== undefined
  );
};

const calculateBacklogForConsumerService = async (clusterName: string, serviceArn: string) => {
  const serviceDetails = await ecsClient.send(
    new DescribeServicesCommand({
      cluster: clusterName,
      services: [serviceArn],
      include: ['TAGS']
    })
  );

  const desiredCount = serviceDetails.services?.[0].desiredCount || 0;
  const queueName = serviceDetails.services?.[0].tags?.find((tag) => tag.key === 'QueueUrl')?.value || '';

  const queueLength = await getQueueLength(queueName);

  const acceptableBacklogPerInstance = parseInt(
    serviceDetails.services?.[0].tags?.find((tag) => tag.key === 'AcceptableBacklogPerInstance')?.value || '0'
  );

  return {
    serviceName: serviceDetails.services?.[0].serviceName || 'UNKNOWN',
    desiredCount: desiredCount,
    queueLength: queueLength,
    acceptableBacklogPerInstance: acceptableBacklogPerInstance
  };
};

const getQueueLength = async (queueUrl: string) => {
  const queueDetails = await sqsClient.send(
    new GetQueueAttributesCommand({
      QueueUrl: queueUrl,
      AttributeNames: ['ApproximateNumberOfMessages']
    })
  );

  return parseInt(queueDetails.Attributes?.ApproximateNumberOfMessages || '0');
};

const putConsumerInstanceBacklogMetric = async (metrics: MetricDatum[]) => {
  await cloudwatchClient.send(
    new PutMetricDataCommand({
      Namespace: 'ECS/CustomServiceMetrics',
      MetricData: metrics
    })
  );
};

It also relies on some tags on the ECS services. It's a bit messy and not well refined at the moment since I'm still testing and working on edge cases like the scale to zero ones. It loops through all services in a cluster and based on their tags and metrics, defines a new metric called ConsumerInstanceBacklog to do target tracking against.

I'd advise doing something similar. The main issues came on stack updates. You can't create the scaling policy without defining a name and when defining a name I ended up with tons of issues trying to update/replace/rollback etc. I'd recommend not using the above method for those reasons.

zubairzahoor commented 11 months ago

@alexbaileyuk I am more comfortable using the above, works well for me. Were there any issues you encounted with scaling-in using the custom resource?

alexbaileyuk commented 11 months ago

@zubairzahoor we're going to production later in the week with a more refined version of the code. We've not found any major issues so far.

bmeudre commented 6 months ago

Very interesting discussion. I managed to fix it with CDK-only syntax. I hope it helps 😉

const resourceId = `endpoint/${this.endpointName}/variant/${variant.name}`;

// To define min/max values
const target = new ScalableTarget(this, 'ScalableTarget', {
  serviceNamespace: ServiceNamespace.SAGEMAKER,
  minCapacity: variant.autoScale.minCapacity,
  maxCapacity: variant.autoScale.maxCapacity,
  scalableDimension: 'sagemaker:variant:DesiredInstanceCount',
  resourceId,
});

// We need the endpoint before creating the autoscaling policy
target.node.addDependency(endpoint);

const scalingPolicy = new CfnScalingPolicy(this, 'ScalingPolicy', {
  policyName: resourceId,
  scalingTargetId: target.scalableTargetId,
  policyType: 'TargetTrackingScaling',
  targetTrackingScalingPolicyConfiguration: {
    targetValue: variant.autoScale.targetProcessingTime,
  },
});

// CDK doesn't support math expression in target tracking, adding it in cloudformation manually
scalingPolicy.addPropertyOverride(
  'TargetTrackingScalingPolicyConfiguration.CustomizedMetricSpecification',
  {
    Metrics: [
      {
        Id: 'm1',
        ReturnData: false,
        MetricStat: {
          Stat: 'Average',
          Metric: {
            MetricName: 'TotalProcessingTime',
            Namespace: 'AWS/SageMaker',
            Dimensions: [
              {
                Name: 'EndpointName',
                Value: this.endpointName,
              },
              {
                Name: 'VariantName',
                Value: variant.name,
              },
            ],
          },
        },
      },
      {
        Id: 'm2',
        ReturnData: true,
        Expression: 'FILL(m1, 0)',
      },
    ],
  }
);
YIHONG-JIN commented 3 weeks ago

Given that CloudFormation has officially supported Target Tracking Scaling on Metric Math Link, we can use L1 Construct now. This solution may require a higher aws-cdk-lib version. For example, to scale ECS Service with application_autoscaling:

// Register the ECS Fargate Service as a scalable target for Application AutoScaling
const serviceScalableTarget = new aws_applicationautoscaling.ScalableTarget(this,
    "serviceScalableTarget",
    {
        serviceNamespace: aws_applicationautoscaling.ServiceNamespace.ECS,
        scalableDimension: "ecs:service:DesiredCount",
        resourceId: `service/${clusterName}/${serviceName}`,
        minCapacity: ecsMinCapacity,
        maxCapacity: ecsMaxCapacity,
    }
)

// Documentation: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-scalingpolicy-targettrackingmetricdataquery.html
const mathExpressionSpecification: CfnScalingPolicy.CustomizedMetricSpecificationProperty = {
    metrics: [
        {
            expression: "approximateNumberOfMessagesVisible / desiredTaskCount",
            id: "sqsBacklogPerECSTask",
            label: "SQSBacklogPerECSTask",
            returnData: true,
        },
        {
            id: "desiredTaskCount",
            label: "DesiredTaskCount",
            metricStat: {
                metric: {
                    namespace: CONTAINER_INSIGHTS_NAMESPACE,
                    metricName: "DesiredTaskCount",
                    dimensions: [{
                        name: "ClusterName",
                        value: clusterName
                    }, {
                        name: "ServiceName",
                        value: serviceName
                    }],
                },
                stat: "Average",
            },
            returnData: false,
        },
        {
            id: "approximateNumberOfMessagesVisible",
            label: "ApproximateNumberOfMessagesVisible",
            metricStat: {
                metric: {
                    namespace: SQS_NAMESPACE,
                    metricName: "ApproximateNumberOfMessagesVisible",
                    dimensions: [{
                        name: "QueueName",
                        value: sqsQueueName
                    }],
                },
                stat: "Average",
            },
            returnData: false,
        }
    ],
};

// Documentation: https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_applicationautoscaling.CfnScalingPolicy.TargetTrackingScalingPolicyConfigurationProperty.html
const serviceScalingPolicy = new aws_applicationautoscaling.CfnScalingPolicy(this,
    "serviceScalingPolicy",
    {
        policyName: "serviceScalingPolicy",
        policyType: "TargetTrackingScaling",
        scalingTargetId: serviceScalableTarget.scalableTargetId,
        targetTrackingScalingPolicyConfiguration: {
            targetValue: targetValueForSQSBacklogPerECSTask,
            scaleInCooldown: scaleInCooldownForTargetTrackingScaling,
            scaleOutCooldown: scaleOutCooldownForTargetTrackingScaling,
            customizedMetricSpecification: mathExpressionSpecification,
        }
    }
)

This solution is equivalent to Create a target tracking scaling policy for Application Auto Scaling using metric math

YIHONG-JIN commented 3 weeks ago

@pahud Looking forward to CDK L2 or L3 level supports for this feature. Is there a plan for it?