DataDog / datadog-lambda-js

The Datadog AWS Lambda Library for Node
Apache License 2.0
109 stars 35 forks source link

`bind EMFILE 0.0.0.0` errors with Node 20 #452

Closed kimmoahokas closed 8 months ago

kimmoahokas commented 9 months ago

Expected Behavior

Function works the same as with Node 18

Actual Behavior

Function error rate is about 1-2% with following error in logs (added new lines for easier reading)

2023-12-12T13:22:28.894Z
4f6bb64f-d04b-48ca-a6c3-a55abc6e36c8
ERROR
[dd.trace_id=364466785825974076 dd.span_id=364466785825974076]
Uncaught Exception 
{
  "errorType": "Error",
  "errorMessage": "bind EMFILE 0.0.0.0",
  "code": "EMFILE",
  "errno": -24,
  "syscall": "bind",
  "address": "0.0.0.0",
  "stack": [
    "Error: bind EMFILE 0.0.0.0",
    "    at node:dgram:363:20",
    "    at AsyncResource.runInAsyncScope (node:async_hooks:206:9)",
    "    at bound (node:async_hooks:238:16)",
    "    at /opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/dns.js:79:12",
    "    at AsyncResource.runInAsyncScope (node:async_hooks:206:9)",
    "    at bound (node:async_hooks:238:16)",
    "    at process.processTicksAndRejections (node:internal/process/task_queues:83:21)"
  ]
}

Steps to Reproduce the Problem

I don't have isolated reproduction steps for this, but so far we have seen this with all the lambda functions we have tried to update to node 20. Seems to happen at least with lambdas that use AWS SDK v3 or that do HTTPS requests with got v13. We are currently trying to implement a minimal reproduction case.

Specifications

Stacktrace

Error: bind EMFILE 0.0.0.0
at node:dgram:363:20
at AsyncResource.runInAsyncScope (node:async_hooks:206:9)
at bound (node:async_hooks:238:16)
at /opt/nodejs/node_modules/dd-trace/packages/datadog-instrumentations/src/dns.js:79:12
at AsyncResource.runInAsyncScope (node:async_hooks:206:9)
at bound (node:async_hooks:238:16)
at process.processTicksAndRejections (node:internal/process/task_queues:83:21)
jokalli2 commented 9 months ago

We can see the issue (increasing fd_use in Lambda Insights metrics) with even a trivial lambda that does no outgoing HTTP requests by itself.

jokalli2 commented 9 months ago

We have now verifying that after removing the Datadog lambda layers from the affected lambda, the fd_use metrics drops significantly (from hitting the service limit 1024 to about 20).

Is there anything that we can do still to investigate the issue? Can we disable some parts of the instrumentation to narrow down the problem?

astuyve commented 9 months ago

Hey folks, thanks for this report. I have managed to reproduce it sporadically, but it'd be great if we could narrow down any specific layer version.

I only managed to reproduce this in a kinesis writer service, using the v3 sdk (about .5% of requests).

Does this occur for node18? For the aws-sdk v2? I'm asking about the SDK because a cursory google search shows this occurring historically in that project.

Thanks, stay tuned!

kimmoahokas commented 9 months ago

For us it does not happen on Node 18 at all. On Node 20 happens about 0.5% of the time even without AWS SDK.

jokalli2 commented 9 months ago

Thanks for the response! And by the way very nice talk at re:Invent. We are not using aws-sdk v2 any more, so that version we haven't tested. But this happens for all our lambdas that we have updated to Node 20, most are using aws-sdk v3.

We also verified that it happens on both X86 and ARM architectures. So far we haven't hit the 1024 fd limit with Node 18 in any repo.

astuyve commented 9 months ago

We've identified the issue and are working on a fix, although I haven't the foggiest idea why it doesn't fail in earlier versions of Node.

I expect it to ship early next week, and will closet his ticket when the release is successful.

Thank you everyone for the patience!

astuyve commented 8 months ago

We've merged the fix and moved this along to our internal test service. I intend to let it stress-test for several hours before releasing, and will update this ticket when that occurs.

Thanks!

astuyve commented 8 months ago

Hey folks, this should be fixed in v103 We still need to tie up a few loose ends but the major cause of FD leaks should be fixed here.

Thanks!

jokalli2 commented 8 months ago

Thanks! We can verify that fd usage is now stable.

astuyve commented 8 months ago

Loose ends tied up here: https://github.com/DataDog/datadog-lambda-js/pull/456