cdklabs / cdk-monitoring-constructs

Easy-to-use CDK constructs for monitoring your AWS infrastructure
https://constructs.dev/packages/cdk-monitoring-constructs
Apache License 2.0
471 stars 63 forks source link

[step-functions] Step Functions failed execution rate alarm inaccurate #170

Open gordumb opened 2 years ago

gordumb commented 2 years ago

Version

1.8.2

Steps and/or minimal code example to reproduce

Using the addFailedExecutionRateAlarm option as shown below:

addFailedExecutionRateAlarm: {
  Warning: {
    maxErrorRate: 0.50,
    period: Duration.hours(12),
  },
},

Expected behavior

I expect the failure rate to somehow account for total executions, similar to the following:

{
    "metrics": [
        [ { "expression": "failures / (failures + successes + aborts + timeouts)", "label": "Failure Rate", "id": "failure_rate" } ],
        [ "AWS/States", "ExecutionsFailed", "StateMachineArn", "arn:aws:states:us-west-2:012345678901:stateMachine:MyStateMachine", { "id": "failures", "visible": false } ],
        [ ".", "ExecutionsSucceeded", ".", ".", { "id": "successes", "visible": false } ],
        [ ".", "ExecutionsAborted", ".", ".", { "id": "aborts", "visible": false } ],
        [ ".", "ExecutionsTimedOut", ".", ".", { "id": "timeouts", "visible": false } ]
    ],
    "view": "timeSeries",
    "stacked": false,
    "period": 43200,
    "annotations": {
        "horizontal": [
            {
                "label": "Failed (avg) > 0.5 for 1 datapoints within 12 hours",
                "value": 0.5
            }
        ]
    },
    "stat": "Sum"
}

Actual behavior

Currently only uses the ExecutionsFailed metric by taking the average, even though it is a count metric. The curve of the graph is similar to the expected graph, but the amplitude does not accurately represent the true failure rate.

{
    "metrics": [
        [ "AWS/States", "ExecutionsFailed", "StateMachineArn", "arn:aws:states:us-west-2:012345678901:stateMachine:MyStepFunction", { "id": "m1", "label": "Failed (avg)", "stat": "Average", "visible": true } ]
    ],
    "view": "timeSeries",
    "stacked": false,
    "period": 43200,
    "annotations": {
        "horizontal": [
            {
                "label": "Failed (avg) > 0.5 for 1 datapoints within 12 hours",
                "value": 0.5
            }
        ]
    },
}

Other details

No response

voho commented 2 years ago

I see. Would it be fair to introduce a new RATE_COMPUTATION_METHOD for this scenario, which would be something like "RELATIVE_RATIO"? It could be applicable to other places too.