DataDog / serverless-plugin-datadog

Serverless plugin to automagically instrument your Lambda functions with Datadog
Apache License 2.0
96 stars 49 forks source link

Not able to create high_error_rate monitor, 400 bad request #545

Closed mariohoyos92 closed 3 days ago

mariohoyos92 commented 1 week ago

Expected Behavior

When using the high_error_rate recommended monitor, I would expect this to successfully create a monitor.

Actual Behavior

When using the high_error_rate recommended monitor, I get 400 Bad Request: This could be due to incorrect syntax or a missing required tag for high_error_rate.

I believe this is happening because when we fetch the recommended monitors from datadog, there isn't anything that's returned that would get keyed as high_error_rate so the isRecommendedMonitor function returns false and we never set a query in the request body to create the monitor.

I also think this line https://github.com/DataDog/serverless-plugin-datadog/blob/2d1e5f66bd950f5d0f4d50f2759de46ee4873e62/src/monitor-api-requests.ts#L214 doesn't really do anything, because after inspecting the payload we get from datadog the ids are keyed with serverless- not serverless_

Steps to Reproduce the Problem

  1. Attempt to create a fresh high_error_rate recommended monitor

Specifications

Stacktrace

Here is what I logged out when calling the getRecommendedMonitors api

recommendedMonitors: {
    'serverless-[enhanced_metrics]_lambda_function_cold_start_rate_is_high': {
      name: 'High Cold Start Rate on $functionName in $regionName for $awsAccount',
      threshold: 0.2,
      message: 'More than 20% of the function's invocations were cold starts in the selected time range. This monitor uses an enhanced metric - to receive enhanced metrics, instrument your Lambda function with the [Datadog Lambda Extension](https://docs.datadoghq.com/serverless/libraries_integrations/extension/). Datadog's [enhanced metrics](https://docs.datadoghq.com/serverless/enhanced_lambda_metrics) and [distributed tracing](https://docs.datadoghq.com/serverless/distributed_tracing) can help you understand the impact of cold starts on your applications today. {{#is_alert}} \n' +
        ' Resolution: \n' +
        ' * Cold starts occur when your serverless applications receive sudden increases in traffic, and can occur when the function was previously inactive or when it was receiving a relatively constant number of requests. \n' +
        ' * Users may perceive cold starts as slow response times or lag. To get ahead of cold starts, consider enabling [provisioned concurrency](https://www.datadoghq.com/blog/monitor-aws-lambda-provisioned-concurrency/) on your impacted Lambda functions. Note that this could affect your AWS bill. {{/is_alert}}',
      type: 'query alert',
      query: [Function: query],
      templateVariables: [Array]
    },
    'serverless-[enhanced_metrics]_lambda_function_cost_is_increasing': {
      name: 'Increased Cost on $functionName in $regionName for $awsAccount',
      threshold: 20,
      message: 'Estimated cost of invocations have increased more than 20%. This monitor uses an enhanced metric - to receive enhanced metrics, instrument your Lambda function with the [Datadog Lambda Extension](https://docs.datadoghq.com/serverless/libraries_integrations/extension/). {{#is_alert}} To investigate further: \n' +
        ' * Look into other function metrics to understand what might be causing the increase in bill. \n' +
        ' * Turn on [cloud cost monitoring](https://docs.datadoghq.com/cloud_cost_management/) to get rich observability into your infrastructure. {{/is_alert}}',
      type: 'query alert',
      query: [Function: query],
      templateVariables: [Array]
    },
    'serverless-[enhanced_metrics]_lambda_function_is_running_out_of_memory': {
      name: 'Out of Memory on $functionName in $regionName for $awsAccount',
      threshold: 0,
      message: 'At least one invocation in the selected time range ran out of memory. This monitor uses an enhanced metric - to receive enhanced metrics, instrument your Lambda function with the [Datadog Lambda Extension](https://docs.datadoghq.com/serverless/libraries_integrations/extension/). {{#is_alert}} Resolution: Lambda functions that use more than their allocated memory can be terminated by the Lambda runtime. To users, this may look like failed requests to your application. Consider increasing the amount of memory your Lambda function is allowed to use. If the function runtime is Node or Python, explore [Profiling](https://docs.datadoghq.com/serverless/aws_lambda/profiling/) to identify parts of your application using excessive amounts of memory. {{/is_alert}}',
      type: 'query alert',
      query: [Function: query],
      templateVariables: [Array]
    },
    'serverless-lambda_function_invocations_are_failing': {
      name: 'High Error Rate on $functionName in $regionName for $awsAccount',
      threshold: 0.1,
      message: 'More than 10% of the function's invocations were errors in the selected time range.  {{#is_alert}} \n' +
        ' Resolution: \n' +
        ' * Look for failures in traces with errors. Examine the function's top errors.\n' +
        ' * Go through recent function errors logs. \n' +
        ' * Check for recent code or configuration changes. {{/is_alert}}',
      type: 'query alert',
      query: [Function: query],
      templateVariables: [Array]
    },
    "serverless-lambda_function's_iterator_age_is_increasing": {
      name: 'High Iterator Age on $functionName in $regionName for $awsAccount',
      threshold: 86400000,
      message: 'The function's iterator was older than 24 hours. Iterator age measures the age of the last record for each batch of records processed from a stream. When this value increases, it means your function cannot process data fast enough. {{#is_alert}} \n' +
        ' Resolution: \n' +
        '  * If [distributed tracing](https://docs.datadoghq.com/serverless/distributed_tracing) is enabled, see what is causing latency in your function traces. \n' +
        ' * Visit function logs to see what is happening in your function \n' +
        ' * You can also consider increasing the shard count and batch size of the stream your function reads from. {{/is_alert}}',
      type: 'query alert',
      query: [Function: query],
      templateVariables: [Array]
    },
    'serverless-lambda_function_invocations_are_throttling': {
      name: 'High Throttles on $functionName in $regionName for $awsAccount',
      threshold: 0.2,
      message: 'More than 10% of invocations in the selected time range were throttled. Throttling occurs when your serverless Lambda applications receive high levels of traffic without adequate [concurrency](https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html). {{#is_alert}} \n' +
        ' Resolution: \n' +
        ' * Check your Lambda metrics and confirm if `aws.lambda.concurrent_executions.maximum` is approaching your AWS account concurrency level. If so, consider configuring reserved concurrency, or request a service quota increase from AWS. \n' +
        ' * Note that this may affect your AWS bill. {{/is_alert}}',
      type: 'query alert',
      query: [Function: query],
      templateVariables: [Array]
    },
    'serverless-some_of_your_step_functions_failed_upon_attempted_execution': {
      name: 'Some of your executions in state machine {{statemachinename.name}} failed upon attempted execution',
      threshold: 20,
      message: '{{#is_alert}}\n' +
        'Some of your executions in state machine {{statemachinename.name}} failed when they attempted to execute. To investigate this issue, navigate to [the Step Functions page of the AWS Serverless integration](/functions?search=statemachinename%3A{{statemachinename.name}}cloud=aws&entity_view=step_functions).\n' +
        '{{/is_alert}}',
      type: 'query alert',
      query: [Function: query],
      templateVariables: undefined
    },
    'serverless-many_executions_in_your_step_function_are_throttling': {
      name: 'Many executions in your state machine {{statemachinename.name}} are throttling',
      threshold: 20,
      message: '{{#is_alert}}\n' +
        'A high percentage of your executions in state machine {{statemachinename.name}} are throttling. To investigate this issue, navigate to [the Step Functions page of the AWS Serverless integration](/functions?search=statemachinename%3A{{statemachinename.name}}&cloud=aws&entity_view=step_functions).\n' +
        '{{/is_alert}}',
      type: 'query alert',
      query: [Function: query],
      templateVariables: undefined
    },
    'serverless-your_step_functions_are_timing_out_often': {
      name: 'Executions in state machine {{statemachinename.name}} are timing out often',
      threshold: 20,
      message: '{{#is_alert}}\n' +
        'Your state machine {{statemachinename.name}} has executions that are timing out frequently. To investigate this issue, navigate to [the Step Functions page of the AWS Serverless integration](/functions?search=statemachinename%3A{{statemachinename.name}}&cloud=aws&entity_view=lambda_functions) to see which are timing out.\n' +
        '{{/is_alert}}',
      type: 'query alert',
      query: [Function: query],
      templateVariables: undefined
    },
    'serverless-lambda_function_is_timing_out': {
      name: 'Timeout on $functionName in $regionName for $awsAccount',
      threshold: 1,
      message: 'At least one invocation in the evaluated time range timed out. This occurs when your function runs for longer than the configured timeout or the global Lambda timeout. \n' +
        ' {{#is_alert}} Resolution: \n' +
        ' * View slow traces for this function to help you pinpoint slow requests to APIs and other microservices.\n' +
        ' * You can also consider increasing the timeout of your function. Note that this could affect your AWS bill. {{/is_alert}}',
      type: 'query alert',
      query: [Function: query],
      templateVariables: [Array]
    }
  }
lym953 commented 1 week ago

Thanks for reporting! I'll try to take a look within one week.

lym953 commented 1 week ago

I reproduced the issue and identified the cause. Trying to figure out how to fix it..

lym953 commented 3 days ago

The fix has been released, which should fix this issue. Closing it. Feel free to reopen it if you have any question.