cdklabs / cdk-monitoring-constructs

Easy-to-use CDK constructs for monitoring your AWS infrastructure
https://constructs.dev/packages/cdk-monitoring-constructs
Apache License 2.0
443 stars 56 forks source link

minMetricSamplesToAlarm uses incorrect evaluation period #452

Closed miloszwatroba closed 7 months ago

miloszwatroba commented 7 months ago

Version

6.2.2

Steps and/or minimal code example to reproduce

Reproducible with unit tests included in the package: https://github.com/cdklabs/cdk-monitoring-constructs/blob/84609a4d81ad5367ee14a442411c82fa4d368407/test/common/alarm/AlarmFactory.test.ts#L275

Expected behavior

NoSamples child alarm should use the same datapointsToAlarm and evaluationPeriods as the alarm that we are attaching minMetricSamplesToAlarm to.

Example:

  factory.addAlarm(metric, {
    ...props,
    alarmNameSuffix: "none",
    comparisonOperator: ComparisonOperator.LESS_THAN_THRESHOLD,
    minMetricSamplesToAlarm: 500,
    datapointsToAlarm: 5,
    evaluationPeriods: 5,
  });

should create a NoSamples alarm with datapointsToAlarm: 5 and evaluationPeriods: 5.

Actual behavior

NoSamples child alarm uses datapointsToAlarm and evaluationPeriods different than the main alarm - both are set 1 instead.

Other details

The fact that the datapointsToAlarm and evaluationPeriods for NoSamples child alarm are using different values (i.e. 1), causes incorrect Composite Alarm behaviour.

Imagine monitor where you require 5 breaching datapoints to go into alarm with the minMetricSamplesToAlarm configured to an arbitrary value. Consider the following:

NoSamplesAlarm:         [X, X, X, X, 0, 0, 0, 0, 0]
MainAlarm:              [X, X, X, X, X, 0, 0, 0, 0]
---
CompositeWithSamples:   [0, 0, 0, 0, X, 0, 0, 0, 0]

where 0 is a non-breaching data point, X is a breaching data point

Even though there was only a single datapoint that satisfied both conditions, the monitor went into alarm despite the fact that the MainAlarm had datapointsToAlarm configured to 5 datapoints. NoSamplesAlarm generated by cdk-monitoring-constructs is based on datapointsToAlarm: 1, so if that single datapoint coincides with MainAlarm reaching its own threshold, the overall CompositeWithSamples will go into ALARM anyway. This is doesn't seem like a desirable behaviour - for the CompositeWithSamples to go into ALARM, all breaching datapoints should have sufficient number of samples.

If this is not a bug, is there any rationale behind the NoSamples using a different datapointsToAlarm and evaluationPeriods than the MainAlarm?

akiesler commented 7 months ago

So excited to see this happen. This will be such an improvement for our monitoring