DataDog / datadog-lambda-extension

Datadog Lambda Extension
Apache License 2.0
81 stars 5 forks source link

Are +1 second cold starts on NodeJS to be expected? #188

Closed eliasbrange closed 10 months ago

eliasbrange commented 10 months ago

The documentation states that a short increase in overhead is expected, but there are no pointers as to the expected size of this increase.

We've been seeing at least 1-second increases in cold starts across the board using both NodeJS 18 and 20, which feels a bit much IMO.

I've used the https://github.com/astuyve/cold-start-benchmarker tool to measure the total round trip time for both instrumented and a non-instrumented Lambda function.

For testing, I've used the following minimal samples:

Handler function:

module.exports.handler = async (event) => {
  console.log(event);
  return {
    statusCode: 200,
    body: JSON.stringify(
      {
        message: "Hello there!",
      },
      null,
      2
    ),
  };
};

Serverless.yml (non-instrumented):

service: coldstart-without-dd
frameworkVersion: "3"

provider:
  name: aws
  region: eu-north-1
  runtime: nodejs20.x

functions:
  function1:
    handler: index.handler

Serverless.yml (instrumented):

service: coldstart-with-dd
frameworkVersion: "3"

plugins:
  - serverless-plugin-datadog

provider:
  name: aws
  region: eu-north-1
  runtime: nodejs20.x

custom:
  datadog:
    # api key set using DATADOG_API_KEY environment variable
    site: datadoghq.eu
    env: dev
    service: coldstart-with-dd
    version: 1.0.0

functions:
  function1:
    handler: index.handler

I'm using the latest serverless-plugin-datadog as of this writing, which uses the following layers:

arn:aws:lambda:eu-north-1:464622532012:layer:Datadog-Node20-x:104
arn:aws:lambda:eu-north-1:464622532012:layer:Datadog-Extension:52

Results

Just adding the extension through the serverless plugins increases our cold start RTT from 250-300ms to 1200-1400ms. See the following graph on average RTT in ms. This data is gathered using https://github.com/astuyve/cold-start-benchmarker

image

Logs

Debug logs ``` 2023-12-21T10:23:04.396+01:00 INIT_START Runtime Version: nodejs:20.v13 Runtime Version ARN: arn:aws:lambda:eu-north-1::runtime:0593533e3a90a967b5685780855b7a89f7ba80b7d910603abf59369c6c4c7331 2023-12-21T10:23:04.712+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Datadog extension version : 52|Datadog environment variables: DD_API_KEY=***|DD_CAPTURE_LAMBDA_PAYLOAD=false|DD_ENV=dev|DD_LAMBDA_HANDLER=index.handler|DD_LOGS_INJECTION=false|DD_LOG_LEVEL=debug|DD_MERGE_XRAY_TRACES=false|DD_REMOTE_CONFIGURATION_ENABLED=false|DD_SERVERLESS_LOGS_ENABLED=true|DD_SERVICE=coldstart-with-dd|DD_SITE=datadoghq.eu|DD_TRACE_ENABLED=true|DD_VERSION=1.0.0| 2023-12-21T10:23:04.716+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Starting daemon to receive messages from runtime... 2023-12-21T10:23:04.725+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Unable to restore the state from file 2023-12-21T10:23:04.733+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | No config file detected, using environment variable based configuration only 2023-12-21T10:23:04.733+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | 'use_proxy_for_cloud_metadata' is enabled: adding cloud provider URL to the no_proxy list 2023-12-21T10:23:04.733+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | FIPS mode is disabled 2023-12-21T10:23:04.733+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | 0 Features detected from environment: 2023-12-21T10:23:04.734+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Using a SyncForwarder with a 5s timeout 2023-12-21T10:23:04.739+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | Retry queue storage on disk is disabled 2023-12-21T10:23:04.739+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | Creating TimeSampler #0 2023-12-21T10:23:04.739+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | 'telemetry.dogstatsd.aggregator_channel_latency_buckets' is empty, falling back to default values 2023-12-21T10:23:04.739+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Forwarder started 2023-12-21T10:23:04.739+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Demultiplexer started 2023-12-21T10:23:04.739+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | 'telemetry.dogstatsd.listeners_latency_buckets' is empty, falling back to default values 2023-12-21T10:23:04.739+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | 'telemetry.dogstatsd.listeners_channel_latency_buckets' is empty, falling back to default values 2023-12-21T10:23:04.742+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | dogstatsd-udp: 127.0.0.1:8125 successfully initialized 2023-12-21T10:23:04.742+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | DogStatsD will run 2 workers 2023-12-21T10:23:04.744+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | dogstatsd-udp: starting to listen on 127.0.0.1:8125 2023-12-21T10:23:04.750+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | appsec: security monitoring is not enabled: DD_SERVERLESS_APPSEC_ENABLED is not set to true 2023-12-21T10:23:04.750+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Enabling telemetry collection HTTP route 2023-12-21T10:23:04.750+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Subscribing to Telemetry for types: [platform function extension] 2023-12-21T10:23:04.750+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | No config file detected, using environment variable based configuration only 2023-12-21T10:23:04.750+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | 'use_proxy_for_cloud_metadata' is enabled: adding cloud provider URL to the no_proxy list 2023-12-21T10:23:04.751+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | otlp endpoint disabled 2023-12-21T10:23:04.751+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | FIPS mode is disabled 2023-12-21T10:23:04.753+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | 0 Features detected from environment: 2023-12-21T10:23:04.754+01:00 TELEMETRY Name: datadog-agent State: Subscribed Types: [Platform, Function, Extension] 2023-12-21T10:23:04.756+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | Loaded configuration: /var/task/datadog.yaml 2023-12-21T10:23:04.756+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Setting DefaultEnv to "dev" (from 'env' config option) 2023-12-21T10:23:04.758+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | Starting logs-agent... 2023-12-21T10:23:04.758+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | INFO | logs-agent started 2023-12-21T10:23:04.758+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Adding AWS Logs Log Source 2023-12-21T10:23:04.760+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Starting ColdStartSpanCreator 2023-12-21T10:23:04.760+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | Setting ColdStartSpanCreator on Daemon 2023-12-21T10:23:04.760+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | finding the lambda runtime took 606.381µs. found runtime: nodejs20.x 2023-12-21T10:23:04.771+01:00 2023-12-21 09:23:04 UTC | DD_EXTENSION | DEBUG | serverless agent ready in 54.868115ms 2023-12-21T10:23:05.293+01:00 EXTENSION Name: datadog-agent State: Ready Events: [INVOKE, SHUTDOWN] 2023-12-21T10:23:05.294+01:00 START RequestId: 47d7f666-dda2-42eb-890f-8e43c71800e5 Version: $LATEST 2023-12-21T10:23:05.295+01:00 2023-12-21 09:23:05 UTC | DD_EXTENSION | DEBUG | Received invocation event... ```

Something that stands out is the time between the serverless agent ready log and EXTENSION Name: datadog-agent... log. What is happening during those 522 milliseconds?

astuyve commented 10 months ago

Hi @eliasbrange,

Thanks so much for reaching out!

The short answer is that it depends 😄

What you can do today

Your current configuration is enabling automatic APM instrumentation, logs, and metrics – all with the Datadog extension and a zip-based lambda function.

That's going to be on the higher end of the cold start spectrum because effectively you're replacing all of Cloudwatch with a combination of the tracer and the extension. Users often choose to go this route because they can disable Cloudwatch logs and save substantial amounts of money.

You can make many performance tradeoffs including manually instrumenting your code, using the Datadog Lambda Forwarder (which replaces the extension entirely), or even using x-ray (which we can merge).

So what's happening during those 522 milliseconds?

Something that stands out is the time between the serverless agent ready log and EXTENSION Name: datadog-agent... log. What is happening during those 522 milliseconds?

The Lambda Extensions API is designed to allow many things (which we rely on), but also allows folks to modify the runtime environment of the function itself (which we do not rely on).

This feature means that all extensions present must fully initialize before the function runtime and code even begins being copied into the Lambda sandbox. image

Answer:

My understanding is that 522ms is Lambda copying the node runtime + your code into the sandbox, and performing static initialization.

To further illustrate this point, I added some debug points to the extension and created a demonstration application. I wrapped the code which calls /next in two log statements:

    log.Debugf("[extension] - calling /next")
    if response, err = client.Do(request); err != nil {
        return fmt.Errorf("WaitForNextInvocation: while GET next route: %v", err)
    }
    log.Debugf("[extension] - got response from /next")

And then outside of my handler function, I logged my function init:

'use strict';
console.log('[handler] initializing function code')

module.exports.hello = async (event) => {
  return {
    statusCode: 200,
    body: JSON.stringify(
      {
        message: 'Go Serverless v1.0! Your function executed successfully!',
        input: event,
      },
      null,
      2
    ),
  };
};

When I run the code, we see that the longest gap exists between calling /next and the initialization of the handler code: image and finally afterwards we receive a response, which begins the invocation.

We're constantly working on improving the performance of our tracing libraries and extension code, and that includes working with our partners at AWS as well.

I hope I've helped! AJ

eliasbrange commented 10 months ago

Thanks for the response.

To gather more data, I added two more functions. One that has the AppConfig extension and one that has the Lambda Insights extension.

I haven't checked the code of those two (are they public?), but I assume they similarly use the same registration API.

Since my four functions use the same runtime and handler code, I'd expect all three variants with extensions to behave similarly and have the ~500 ms delay. But I am seeing the following instead:

image

Those other extensions add around 100ms to the round-trip time, much less than the Datadog extension.

Could there be something more to it than copying the node runtime + your code into the sandbox, and performing static initialization.?

astuyve commented 10 months ago

Hey folks! We narrowed this down both in this issue and on slack with a bit of code sharing. The extra duration includes time to load and instrument the hander function itself, which will vary compared with other lambda extensions.

Feel free to reach out with any questions.

Thanks!

astuyve commented 3 months ago

hey @eliasbrange you should try version 62 and setting DD_EXTENSION_VERSION to next, let me know what you think!