Open akiesler opened 1 year ago
For anyone else facing this same issue I was able to hack together this solution.
import { Alarm, CfnAlarm, MathExpression } from 'aws-cdk-lib/aws-cloudwatch';
import { MonitoringFacade } from 'cdk-monitoring-constructs';
// Hack the underlying CFN types to be mutable so we can modify them.
type MetricDataQueryProperty = {
-readonly [K in keyof CfnAlarm.MetricDataQueryProperty]: CfnAlarm.MetricDataQueryProperty[K];
};
type MetricStatProperty = {
-readonly [K in keyof CfnAlarm.MetricStatProperty]: CfnAlarm.MetricStatProperty[K];
};
export function fixMetricMathMinSamples(monitoringFacade: MonitoringFacade) {
// Find all minimum samples metric math alarms
const alarms = monitoringFacade.node.children
.filter((c) => c instanceof Alarm)
.map((alarm) => alarm as Alarm)
.filter((alarm) => alarm.node.id.includes('NoSamples') && alarm.metric instanceof MathExpression);
alarms.forEach((alarm) => {
const metrics = (alarm.node.defaultChild! as CfnAlarm).metrics! as MetricDataQueryProperty[];
// Convert Metric Math to use the MAX Sample Count
metrics.filter((m) => m.expression).forEach((m) => (m.expression = 'MAX(METRICS())'));
// Convert Metrics to using SampleCount
metrics.filter((m) => m.metricStat).forEach((m) => ((m.metricStat as MetricStatProperty).stat = 'SampleCount'));
});
}
Would love to see the logic fixed in the library itself.
This is where minMetricSamplesToAlarm
is currently handled: https://github.com/cdklabs/cdk-monitoring-constructs/blob/main/lib/common/alarm/AlarmFactory.ts#L575-L607
We create an additional alarm based on a sample count variation of the provided metric that's then used in a composite alarm.
I'm not entirely sure how to best solve for this. Potentially the most straightforward naive solution would be to create a noSamplesAlarm
equivalent for every underlying metric (e.g., Object.values(adjustedMetric.usingMetrics)
), then use all of those in the composite alarm. There's a risk that we'd hit the limit for a composite alarm though.
On the other hand, DATAPOINT_COUNT
is not recommend for use in alarms as per the docs:
We recommend that you do not use this function in CloudWatch alarms. Whenever an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than the number specified as Evaluation Periods. This function acts differently when extra data is requested.
Considering you provided the above hacky solution, did you have any opinion @akiesler ?
I think the devil is in the details. Do we know how the function behaves differently and went what extra data is requested?
This function acts differently when extra data is requested.
My own experience has been that using the DATAPOINT_COUNT
hasn't resulted in any significant issues but I wouldn't want to say it will work for all users.
I think that it would be a viable stop-gap to ensure customers are getting at least some protection while we investigate the long-term solution.
Version
5.3.4
Steps and/or minimal code example to reproduce
Expected behavior
I would expect the corresponding "NoSamples" alarm to either check each metrics has the necessary number of samples or wrap the entire expression in
DATAPOINT_COUNT
to check for the correct sample count.Actual behavior
The same metric alarm is created because there is no statistic prop to update.
Other details
No response