cdklabs / cdk-monitoring-constructs

Easy-to-use CDK constructs for monitoring your AWS infrastructure
https://constructs.dev/packages/cdk-monitoring-constructs
Apache License 2.0
444 stars 56 forks source link

Better estimate for SQS time to drain metrics #390

Open r0b0ji opened 1 year ago

r0b0ji commented 1 year ago

Version

v5.2.3

Steps and/or minimal code example to reproduce

It is not actually a bug but a better and simpler computation exist. Currently, time to drain metrics in SQS is calculated as below [1] , which is indirect. A better estimate can be calculated using RATE function [2].

  1. https://github.com/cdklabs/cdk-monitoring-constructs/blob/81f0c6ba0211bca586c9b994ec7aa037b2cd6e3c/lib/monitoring/aws-sqs/SqsQueueMetricFactory.ts#L82-L92
  2. https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html

Expected behavior

Instead of directly getting the consumption rate, current computation estimate based on different metrics which is less accurate.

Actual behavior

A better and direct method can be used.

Other details

A sample code for this is

{
    "metrics": [
        [ { "expression": "m1/ABS(RATE(m1))", "label": "TimeToDrain (sec)", "id": "e1", "region": "us-east-1" } ],
        [ "AWS/SQS", "ApproximateNumberOfMessagesVisible", "QueueName", "some-test-queue", { "id": "m1", "visible": false, "region": "us-east-1" } ]
    ],
    "view": "timeSeries",
    "stacked": false,
    "region": "us-east-1",
    "stat": "Average",
    "period": 300
}
r0b0ji commented 1 year ago

Also, in the original formula the absolute value of diff need to be taken to avoid getting negative rate impacting the avg and other stats for Time to drain metric. Time to drain can't be negative, if there is no message it will be 0 but current formula adds negative datapoints (though the visibility is capped at 0 min but datapoint are still negative) and which reduces the avg .