DataDog / datadog-lambda-python

The Datadog AWS Lambda Layer for Python
https://docs.datadoghq.com/integrations/amazon_lambda/#installing-and-using-the-datadog-layer
Apache License 2.0
85 stars 46 forks source link

When a line of code raises outside of the handler function, then datadog is not able to detect the error. #210

Closed nalepae closed 6 months ago

nalepae commented 2 years ago

Expected Behavior

When a line of code raises outside of the handler function, then datadog should be able to detect the error.

Actual Behavior

When a line of code raises outside of the handler function, then datadog is not able to detect the error. ==> If an monitor (attached to a Slack alert) is set up when an exception is raised (and not catched) on this lambda, then the corresponding monitor is not triggered and no Slack message is sent.

Steps to Reproduce the Problem

Define the following Lambda function:

import json

# The following line will raise on purpose
0/0

def lambda_handler(event, context):
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }

Set

Run the lambda. ==> Even if the line 0/0 raises, no trace will be visible in the Invocation Serverless part of Datadog. image

Note we see invocations on top left chart (3 blue vertical bars), but there is nothing in the center panel (No traced invocation in the time window), no way to see the traces, the Python stack trace ...

If we move the 0/0 in the handler, like below:

import json

def lambda_handler(event, context):
    # The following line will raise on purpose
    0/0
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }

then Datadog behaves correctly (visible Traces, Monitor, Slack Message ...)

Specifications

Additional information

I understand we specify the handler to Datadog, and thus cannot be aware of things running out of the handler, but as indicated in AWS best practices, there is benefits to run some code out of the handler. If this code fails, it is very important that the developer team is notified.

Take advantage of execution environment reuse to improve the performance of your function. Initialize SDK clients and database connections outside of the function handler, and cache static assets locally in the /tmp directory. Subsequent invocations processed by the same instance of your function can reuse these resources. This saves cost by reducing function run time.

astuyve commented 2 years ago

Hi @nalepae - thanks for this ticket. I'm sorry for the delay in responding to you.

The Datadog library works by wrapping your handler function, so if you've got a syntax error, import error, or divide by zero error outside of your handler function - so I can imagine there could be scenarios where we can't catch a failure of some kind.

I recently attempted this:

import json
def throw():
  0/0

def hello(event, context):

    body = {
        "message": "Go Serverless v1.0! Your function executed successfully!",
        "input": event
    }

    response = {
        "statusCode": 200,
        "body": json.dumps(body),
        "headers": {"content-type": "application/json"}
    }

    return response

And I see logs, metrics, and traces: image and here's the trace with the error: image

As per your note, I think most users will like create methods outside of their handler functions and call them from the handler in order to memoize a connection or cache data - and these calls would be traced and captured by Datadog in the event of a failure.

However, when I removed the throw method and instead just divide by zero, the function crashed entirely with An unknown application error occurred: image This causes our runtime to crash, which is why it's not reported in Datadog.

This looks like it's a bug which could be on our end or something AWS can fix. I'll update you with more information soon.

Thanks again!

astuyve commented 2 years ago

Closing as there has been no reply for over 30 days.

nalepae commented 2 years ago

Closing as there has been no reply for over 30 days.

Yes, but the issue is still here!

astuyve commented 2 years ago

Hi! Thanks for the reply, I'm sorry about that. I'm returning from a few weeks of vacation and mis-remembered my own reply. I think there are a few options here, I'll explore how we can solve this either in the library itself or in the extension.

Thanks!

MatejBalantic commented 9 months ago

I'd like to +1 on this issue. In our case, we run database migrations outside of the handler because we want them to only happen in cold starts (first-time lambda starts) rather than in all consequent warm executions. As we utilize provisioned concurrency, we've got quite a number of lambdas running all the time and gain a lot of benefits from this setup.

However, if our database migration crashes (it does; that's why I am here :)), the errors don't show up in DataDog.

nalepae commented 9 months ago

Yes my use case was the same: A lot of work to do outside the handler because I want them to only happen in cold starts. And is something crashes when this "out of handler code" is executed, then datadog is blind about this event.

astuyve commented 9 months ago

Hi folks, this should still be flagged in log-based error tracking. Is that not showing up?

Is the ask here for this to create an APM span upon failure? Where else would you expect to see init failures flagged?

Thank you!

MatejBalantic commented 9 months ago

Exactly, it should be shown in the APM as a trace/span, like in case of any other error. This is where we always start our investigation from. It is also what drives our error tracking and monitoring, such as alerts for exceeded error ratio etc. right now it flies under the radar

duncanista commented 6 months ago

Closing due to #475, and latest release of this package including it.