DataDog / dd-trace-js

JavaScript APM Tracer
https://docs.datadoghq.com/tracing/
Other
640 stars 303 forks source link

"Error: socket hang up" in Lambda #3670

Open Max101 opened 12 months ago

Max101 commented 12 months ago

Hi I can see this issue has popped up a few times in the past but it seems like its been resolved so I am opening a new issue.

We are experiencing multiple Error: socket hang up in traces BUT not in logs. Our lambda finishes successfully, and there are no errors in the logs. However, where the issue is quite visible, is in APM. We have thousands of similar logs across most of our services.

image

We went to analyze our code and really cannot seem to find an issue. Additionally if this were an issue in our code, it would break, no?

We are on Lambda using NodeJS nodejs16.x Installed library version is dd-trace@4.4.0 Installed DD constructs "datadog-cdk-constructs-v2": "1.7.4",

We are using SST v2 (Serverless Stack) to deploy our lambda code

Our DD config looks like this

const dd = new Datadog(stack, `${stack.stackName}-datadog`, {
    nodeLayerVersion: 91, // Releases: https://github.com/DataDog/datadog-lambda-js/releases
    addLayers: true,
    extensionLayerVersion: 43, // Releases: https://github.com/DataDog/datadog-lambda-extension/releases
    captureLambdaPayload: true,
    enableColdStartTracing: true,
    apiKey: process.env.DATADOG_API_KEY,
    site: 'datadoghq.com',
    enableDatadogTracing: true,
    enableDatadogLogs: true,
    injectLogContext: true,
    env: process.env.NODE_ENV,
    service: stack.stackName,
    version: getDeploymentId(),
  } satisfies DatadogPropsV2);
Harmonickey commented 10 months ago

We're seeing this too, we're just executing in a normal NodeJS context outside of lambda, so perhaps it's more widespread. However, it is throwing an exception to the caller as well so it crashes the executing context for us.

using NodeJS nodejs20.9.0 Installed library version is dd-trace@4.19.0

Error: socket hang up at connResetException (node:internal/errors:721:14) 
    at TLSSocket.socketOnEnd (node:_http_client:519:23) 
    at TLSSocket.emit (node:events:526:35) 
    at TLSSocket.emit (node:domain:488:12) 
    at TLSSocket.emit (/app/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:61:25) 
    at endReadableNT (node:internal/streams/readable:1408:12) 
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
tlhunter commented 9 months ago

@Harmonickey do you know what lead to exception? For example, an outgoing request?

viict commented 8 months ago

We are also suffering from this issue, although we are not hitting it only on lambdas. Most of the time it is related to dynamodb calls.

We are using ddtrace-js v5.1.0

Error: socket hang up
    at connResetException (node:internal/errors:720:14)
    at TLSSocket.socketOnEnd (node:_http_client:525:23)
    at TLSSocket.emit (node:events:529:35)
    at TLSSocket.emit (/usr/src/api/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:69:25)
    at endReadableNT (node:internal/streams/readable:1400:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
luxion-lennart commented 8 months ago

We are also seeing this issue, we see it from lambda doing POST calls to DynamoDB.

Error: socket hang up
    at connResetException (node:internal/errors:720:14)
    at TLSSocket.socketCloseListener (node:_http_client:474:25)
    at TLSSocket.emit (node:events:529:35)
    at TLSSocket.emit (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:61:25)
    at node:net:350:12
    at TCP.done (node:_tls_wrap:614:7)
    at TCP.callbackTrampoline (node:internal/async_hooks:130:17)

CDK Construct: v1.8.0 Extension version: 48

carmargut commented 8 months ago

Same issue here. Not from a lambda, just when doing dynamodb calls. Using 5.2.0

Error: socket hang up
    at connResetException (node:internal/errors:720:14)
    at TLSSocket.socketOnEnd (node:_http_client:525:23)
    at TLSSocket.emit (node:events:529:35)
    at TLSSocket.emit (/app/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:61:25)
    at endReadableNT (node:internal/streams/readable:1400:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
IgnacioFerreras commented 8 months ago

Hi we are suffering it too. using 3.33.0 Error: socket hang up at connResetException (node:internal/errors:705:14) at TLSSocket.socketCloseListener (node:_http_client:467:25) at TLSSocket.emit (node:events:525:35) at TLSSocket.emit (/var/task/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:61:25) at node:net:301:12 at TCP.done (node:_tls_wrap:588:7) at TCP.callbackTrampoline (node:internal/async_hooks:130:17)

astuyve commented 8 months ago

All from the AWS-SDK and all doing dynamo calls? Which version of the aws-sdk is everyone using?

viict commented 8 months ago

@astuyve here we have 2 different versions:

    "@aws-sdk/client-dynamodb": "^3.387.0",
    "@aws-sdk/lib-dynamodb": "^3.387.0",
    "@aws-sdk/smithy-client": "^3.374.0",

and

   "@aws-sdk/client-dynamodb": "=3.40.0",
    "@aws-sdk/lib-dynamodb": "=3.40.0",
    "@aws-sdk/smithy-client": "=3.40.0",
luxion-lennart commented 8 months ago

@astuyve We use version 3.362.0, that is provided by the lambda nodejs runtime.

carmargut commented 8 months ago

@astuyve Here you have:

"@aws-sdk/client-dynamodb": "3.474.0",
"@aws-sdk/util-dynamodb": "3.474.0",
astuyve commented 8 months ago

So far everyone is using the v3 sdk, has anyone reproduced this with v2?

viict commented 7 months ago

@astuyve can we do something for v3 meanwhile no one with v2 answers here? 🙏🏻

astuyve commented 7 months ago

Hi @viict - I'm not sure there's something specific we can do right now. I was hoping someone could replicate with AWS SDK v2 or demonstrate definitively that ddtrace is causing this issue.

Instead, it seems that ddtrace is recording that the tcp connection was closed by the server without a response. I noticed other users reporting the same issue. The aws-sdk author also closed this issue as something that can happen.

I could certainly be wrong here, but I'm still not sure what exactly we'd change in this project at this time.

Does anyone have a minimally reproducible example? Does removing dd-trace solve this definitively? Does this impact application code, or is it successful on retries?

Thanks!

viict commented 7 months ago

@astuyve oh I understand that of course. I'll see what I can do to improve and share here as well if I'm able to answer any of these questions.

Harmonickey commented 7 months ago

@Harmonickey do you know what lead to exception? For example, an outgoing request?

It was an outgoing request from the dd-trace library to DataDog sending an 'info' message.

Here is my initial configuration in case that helps.

const httpTransportOptions = {
    host: 'http-intake.logs.datadoghq.com',
    path: `/v1/input/${environment.datadog.apiKey}?ddsource=nodejs&service=${service}`
        + `&env=${environment.name}&envType=${isWorkerEnv ? 'work' : 'web'}`,
    ssl: true,
};

const logger = createLogger({
    level: 'info',
    exitOnError: false,
    format: format.json(),
    transports: [
        new transports.Http(httpTransportOptions),
    ],
});

Then during runtime calling logger.info('some string message') is when it threw the exception. the message is a static string and it does not always throw.

Because I haven't seen this error in a while, I suspect it was due to DataDog intake servers just being overloaded? So the connection wasn't responded to quickly enough and threw the socket hang up error. Perhaps DataDog has fixed it since then and improved their response times.

atif-saddique-deel commented 5 months ago

@tlhunter any updates here? We are getting a lot socket hang up recently, we are using 4.34.0 version of dd-trace.

[HPM] ECONNRESET: Error: socket hang up
    at connResetException (node:internal/errors:720:14)
    at Socket.socketCloseListener (node:_http_client:474:25)
    at Socket.emit (node:events:529:35)
    at Socket.emit (node:domain:552:15)
    at Socket.emit (/usr/src/app/node_modules/@letsdeel/init/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:69:25)
    at TCP.<anonymous> (node:net:350:12)
    at TCP.callbackTrampoline (node:internal/async_hooks:128:17) {
  code: 'ECONNRESET'
}
antamb commented 5 months ago

We are also experiencing this issue using: "dd-trace": "^5.6.0"

Error: socket hang up
    at Socket.socketOnEnd (node:_http_client:524:23)
    at Socket.emit (node:events:530:35)
    at Socket.emit (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:69:25)
    at endReadableNT (node:internal/streams/readable:1696:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
atif-saddique-deel commented 4 months ago

we are having the same issue with latest version of dd-trace v4.36.0

dvictory commented 4 months ago

Did you switch from node18 to node20? in Node 19 they changed the keep alive default - https://nodejs.org/en/blog/announcements/v19-release-announce#https11-keepalive-by-default Leading to a number of issues: Some outlined here: https://github.com/nodejs/node/issues/47130

we see this around calls to AWS services, sns, sqs, etc. (all self heal with the SDK retry logic). What it unclear to me is if this is an from dd-trace error or is dd-trace just logging the issue from the aws call?

Error: socket hang up
    at TLSSocket.socketOnEnd (node:_http_client:524:23)
    at TLSSocket.emit (node:events:530:35)
    at TLSSocket.emit (/opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:69:25)
    at endReadableNT (node:internal/streams/readable:1696:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

Here is the info tab from this same raw error:

Screenshot 2024-05-19 at 11 24 26 PM
saibotsivad commented 4 months ago

@astuyve we are experiencing the same problem but not related to an AWS SDK issue, and I've been able to track it down to a timeout on an API call.

We are using axios for requests, so the package.json file has:

{
  "dependencies": {
    // ...
    "axios": "^1.6.7",
    // ...
    "datadog-lambda-js": "^7.96.0",
    "dd-trace": "^4.26.0",
    // ...
    "serverless": "^3.38.0",
    // ...
  },
  "devDependencies": {
    // ...
    "serverless-plugin-datadog": "^5.56.0",
    // ...
  },
  // ...
}

We deploy with serverless which has:

# ...
frameworkVersion: '3'
plugins:
  - serverless-plugin-datadog
provider:
  name: aws
  architecture: arm64
  runtime: nodejs16.x
custom:
  version: '1'
  datadog:
    addExtension: true
    apiKey: ${env:DD_API_KEY, ''}
    service: public-charging-api
    env: ${opt:stage}
    version: ${env:DD_VERSION, ''}
    enableDDTracing: true
# ...

We have some API call that uses axios in a pretty normal way, like this:

const response: AxiosResponse = await axios.request({
  method: 'GET',
  url,
  headers: { authorization },
  timeout: 20000,
});

(That's wrapped in a try/catch, so we know exactly what we are logging in any case.)

Functionally: we have a Lambda that makes ~50 HTTP requests in a very amount of time, and sometimes a dozen of them will take too long to resolve, so in that Lambda execution we are timing out those requests.

For every request that is aborted by axios due to timeout, we are getting this "Error: socket hang up" log.

image image

The "third party frames" makes me suspect that it's the DataDog layer adding these.

astuyve commented 4 months ago

Thanks Tobias!! that's a great clue, @tlhunter any thoughts here?

rockymadden commented 1 month ago

I can confirm @saibotsivad's observations as well.

chris-sidestep commented 2 days ago

We are getting this same issue with EventBridge calls on Node18 lambdas. Lambdas execute with no issues, but dd-trace throws up the same 'socket hang up' error in our Traces

gregoryorton-ws commented 2 days ago

We're getting this error running in EKS with DD JS version v5.12.0 which is causing our health checks to fail because it's taking > 3 seconds to finish a request. The root cause is a delay before this socket hang up.

CleanShot 2024-09-27 at 11 58 08