DataDog / datadog-lambda-js

The Datadog AWS Lambda Library for Node
Apache License 2.0
113 stars 35 forks source link

Cold start is unacceptably slow #384

Closed shyouhei closed 1 year ago

shyouhei commented 1 year ago

Expected Behavior

This is a trace I get (via using AWS X-Ray) without having the Datadog-Node18-x layer:

Actual Behavior

This is the trace I actually get (note the flame graph is in seconds, not in milliseconds):

Steps to Reproduce the Problem

I have uploaded a repo to reproduce: https://github.com/shyouhei/datadog-agent-cold-start-issue

Specifications

Stacktrace

  Paste here
astuyve commented 1 year ago

Hi @shyouhei - thanks for reaching out! I appreciate you including a repo! I have several questions which I hope will help narrow in on the possible contributing factors.

The cold start of your function went from 215ms to 910ms, but the actual function duration in the latter case was 3 full seconds. Given that the handler code you provided is relatively empty, I presume you're raising an issue about the significantly increased function duration, is that correct? Not specifically the initialization duration?

Your reproduction case doesn't include configured memory, and I can't tell from your screenshot how much memory was consumed. If your function is using 128mb and also utilizing the datadog lambda extension, I'd suggest increasing the configured memory to at least 256mb, this may solve the issue.

Additionally, to rule out possible interactions with x-ray - can you disable x-ray and reproduce this issue using Datadog tracing?

As a further troubleshooting step, it would be helpful to separate this library from the agent. The terraform template you've provided applies both the datadog lambda extension (which is the datadog agent) as well as this library, datadog-lambda-js. As a test, we could remove the datadog lambda extension, and use the datadog lambda forwarder. This would help eliminate any post-runtime duration which could be incurred by transmitting telemetry data from ap-northeast-1 back to us-east-1. Could you try that, and see if this resolves the issue?

If this does reduce latency, I'd suggest testing this using our new datacenter in Japan. That should help reduce geographically-induced latency.

Finally, it would be good to understand a few other factors, which I can't see from your screenshot. How much memory is this function configured with?

Thanks!

astuyve commented 1 year ago

Hi @shyouhei!

I've attempted to reproduce this with Datadog tracing instead of X-Ray, and was not able to reproduce this issue. This function is using the handler code you provided, using node16.x, and runs in ap-northeast-1, while sending telemetry data back to Datadog's US1 datacenter.

The duration of the function was 2.83ms, and it seems it required roughly 100ms after the lambda function execution finished, in order to flush telemetry data back to datadog:

image

The cold start was around 800ms:

image

I think the next course of action would be to try what I've done here (using datadog tracing instead of x-ray), and seeing if this resolves the issue.

I did this using serverless framework, as I don't have TF set up, but the outcome should be identical. You can reproduce with this template, deploying using DD_API_KEY=<yourkey> serverless deploy:

service: ap-northeast-1
frameworkVersion: '3'

provider:
  name: aws
  runtime: nodejs16.x
  region: ap-northeast-1

custom:
  datadog:
    apiKey: ${env:DD_API_KEY}

functions:
  hello:
    handler: handler.hello
    events:
      - httpApi:
          method: get
          path: /hello

plugins:
  - serverless-plugin-datadog

Thank you!

astuyve commented 1 year ago

Hi @shyouhei!

Just wanted to check and see if you've had a chance to test the changes I suggested earlier. Any luck?

Thanks again!

shyouhei commented 1 year ago

Hello @astuyve. Much appreciated for your super quick response, and sorry for my being slow.

I have since contacted to AWS support about this. No concrete answer yet but it could be on their side. Would also try your suggestion. Let me tell you any updates when available. Thank you very much!

astuyve commented 1 year ago

Hi @shyouhei!

Just wondering if you've been able to test the changes I suggested earlier - have you had any success?

Thanks!

shyouhei commented 1 year ago

Sorry, I was talking with AWS. No progress on my side.

Let me tentatively close this issue. Will reopen when I have more info. Sorry!

astuyve commented 1 year ago

No need to apologize at all @shyouhei

If you ever have any questions or concerns, please do not hesitate to reach out!

Thanks!