dbader / node-datadog-metrics

Buffered metrics reporting via the DataDog HTTP API
https://dbader.org/blog/monitoring-your-nodejs-app-with-datadog
MIT License
142 stars 58 forks source link

Connection error on auto-flush within serverless lambda #116

Closed alyanm closed 6 months ago

alyanm commented 6 months ago

I'm getting errors when calling datadog metrics when running within the AWS serverless lambda environment. These errors seem very similar to the ones reported for dd-trace here.

This is the error I'm seeing:

2024-05-10T17:57:22.183Z    420bef8f-6b1c-445e-ab9e-49cafef3e9b7    ERROR   {"message":"[dd.trace_id=375211601928216273 dd.span_id=3466466051169990597] There was an error auto-flushing metrics: $1 $2","$1":{"message":"request to https://api.us3.datadoghq.com/api/v1/series failed, reason: socket hang up","type":"system","errno":"ECONNRESET","code":"ECONNRESET","name":"FetchError","stack":"FetchError: request to https://api.us3.datadoghq.com/api/v1/series failed, reason: socket hang up
    at ClientRequest.<anonymous> (/var/task/node_modules/cross-fetch/node_modules/node-fetch/lib/index.js:1501:11)
    at ClientRequest.emit (node:events:518:28)
    at req.emit (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/http/client.js:107:25)
    at TLSSocket.socketOnEnd (node:_http_client:524:9)
    at TLSSocket.emit (node:events:530:35)
    at TLSSocket.emit (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:69:25)
    at endReadableNT (node:internal/streams/readable:1696:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)"},"$2":{"o11y":{},"tags":[]},"message_id":"f40208d2-70ed-4623-8b00-088932780c6b"}

I've tried turning off auto-flush by setting the flush interval to 0. I also tried calling flush() manually every time I emit metrics.

I continue to get the same errors.

I plan to attempt to work around the npm plugin and call the datadog metrics api endpoint directly, any help with that would be appreciated!

Mr0grog commented 6 months ago

Are these errors you’re printing and formatting yourself using the onError option, or are you seeing them somewhere else?

These do appear similar to the errors in that dd-trace issue, and dd-trace is at least involved here (although that may only be because it is instrumenting all the HTTP calls, and it could have nothing to do with the error itself). Are you intentionally using dd-trace as well as this library? (This package doesn’t use dd-trace and none of its dependencies should either; just trying to understand how that’s getting involved.)

If you want to work around the actual HTTP stuff here (which calls down through DataDog’s official implementation), you can provide a custom reporter, as documented in the “initialization” section of the README under the reporter option, where it says “Send metrics to a totally different service instead of Datadog”. That said, if there are still errors, you will probably see similar traces, since dd-trace is in there instrumenting all the low-level HTTP machinery in Node.js, and pretty much any HTTP library you use will be calling that same stuff under the hood.

alyanm commented 6 months ago

@Mr0grog --

  1. Yes, these are logged via the onError option.
  2. We use dd-trace on the client side, however these errors are coming from the server side which does not have dd-trace installed.
  3. Interesting, I will look into that option, thanks.
Mr0grog commented 6 months ago

We use dd-trace on the client side, however these errors are coming from the server side which does not have dd-trace installed.

Hmm, is the stack trace in your issue description being generated client side, but showing an error that's coming from the server, then? (So: you're making a request to your server, and your server responds with an error because it failed to log metrics, and then the client is logging the error message from the server along with a stack from the client for where the error was received?)

Otherwise you've probably got some configuration or build issues, since dd-trace is showing up in your stack trace, which implies it is being used on the server where the error is happening.


Either way, are you able to put together some kind of simplified case I or one of the other maintainers here can test in Lambda ourselves?

alyanm commented 6 months ago

I tried calling the API directly to see if that would perhaps work around the issue. However I'm getting the same error on some of the calls, not all calls are throwing an error, only some of my webhook handler lambdas seem to throw this error.

I don't think dd-trace is in this stack:

2024-05-13T17:02:47.801Z    5e93a35a-8552-47fb-aff5-f23aebd39853    ERROR   {"message":"[dd.trace_id=3595417659619954508 dd.span_id=5007414011107405919] Error submitting datadog metrics:  $1 $2","$1":{"message":"request to https://api.us3.datadoghq.com/api/v1/series failed, reason: socket hang up","type":"system","errno":"ECONNRESET","code":"ECONNRESET","name":"FetchError","stack":"FetchError: request to https://api.us3.datadoghq.com/api/v1/series failed, reason: socket hang up
    at ClientRequest.<anonymous> (/var/task/node_modules/cross-fetch/node_modules/node-fetch/lib/index.js:1501:11)
    at ClientRequest.emit (node:events:518:28)
    at req.emit (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/http/client.js:107:25)
    at TLSSocket.socketOnEnd (node:_http_client:524:9)
    at TLSSocket.emit (node:events:530:35)
    at TLSSocket.emit (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:69:25)
    at endReadableNT (node:internal/streams/readable:1696:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)"},"$2":{"o11y":{},"tags":[]},"message_id":"1c6e6ea9-03ff-433b-84a3-7434066a1c8c"}

I'll see if I can develop a simplified case which fails consistently, this may be a bit tricky though.

Mr0grog commented 6 months ago

I don't think dd-trace is in this stack:

2024-05-13T17:02:47.801Z  5e93a35a-8552-47fb-aff5-f23aebd39853    ERROR   {"message":"[dd.trace_id=3595417659619954508 dd.span_id=5007414011107405919] Error submitting datadog metrics:  $1 $2","$1":{"message":"request to https://api.us3.datadoghq.com/api/v1/series failed, reason: socket hang up","type":"system","errno":"ECONNRESET","code":"ECONNRESET","name":"FetchError","stack":"FetchError: request to https://api.us3.datadoghq.com/api/v1/series failed, reason: socket hang up
    at ClientRequest.<anonymous> (/var/task/node_modules/cross-fetch/node_modules/node-fetch/lib/index.js:1501:11)
    at ClientRequest.emit (node:events:518:28)
    at req.emit (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/http/client.js:107:25)
    at TLSSocket.socketOnEnd (node:_http_client:524:9)
    at TLSSocket.emit (node:events:530:35)
    at TLSSocket.emit (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:69:25)
    at endReadableNT (node:internal/streams/readable:1696:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)"},"$2":{"o11y":{},"tags":[]},"message_id":"1c6e6ea9-03ff-433b-84a3-7434066a1c8c"}

dd-trace is in the 3rd and 6th frames from the top of the stack (wrapping emit on the request and on the socket).

I'll see if I can develop a simplified case which fails consistently

This would be super helpful.

If it’s intermittent, my first guess at causes would be one of two things:

  1. (Easier to address) Lambda might just be shutting down in the middle of your request, and you need to make sure to shut down gracefully by making it wait for the metrics flush to finish. I think you need to have at least one extension registered for your lambda, and then you can do something like:

    // You should get a SIGTERM signal in your code when lambda starts shutting down.
    process.on('SIGTERM', () => {
      console.info('SIGTERM received');
    
      // Actually flush and wait for the result:
      metrics.flush(
        () => { console.log('Metrics successfully flushed!'); process.exit(0); },
        (error) => { console.error(`Error flushing metrics: ${error}`); process.exit(1); }
      );
    });

    More info in this official example: https://github.com/aws-samples/graceful-shutdown-with-aws-lambda

  2. Firewall or routing/ingress/egress rules or security groups that are applying to some of the locations your lambdas are running but not others (e.g. applying to only some Availability Zones or Regions). That’s a problem I’ve seen in plenty of people’s AWS setups, although I have more experience in the particulars of how that affects things like ECS/EKS, Beanstalk, and EC2 than Lambda.