Closed kimmoahokas closed 8 months ago
We can see the issue (increasing fd_use
in Lambda Insights metrics) with even a trivial lambda that does no outgoing HTTP requests by itself.
We have now verifying that after removing the Datadog lambda layers from the affected lambda, the fd_use
metrics drops significantly (from hitting the service limit 1024 to about 20).
Is there anything that we can do still to investigate the issue? Can we disable some parts of the instrumentation to narrow down the problem?
Hey folks, thanks for this report. I have managed to reproduce it sporadically, but it'd be great if we could narrow down any specific layer version.
I only managed to reproduce this in a kinesis writer service, using the v3 sdk (about .5% of requests).
Does this occur for node18? For the aws-sdk v2? I'm asking about the SDK because a cursory google search shows this occurring historically in that project.
Thanks, stay tuned!
For us it does not happen on Node 18 at all. On Node 20 happens about 0.5% of the time even without AWS SDK.
Thanks for the response! And by the way very nice talk at re:Invent. We are not using aws-sdk
v2 any more, so that version we haven't tested. But this happens for all our lambdas that we have updated to Node 20, most are using aws-sdk v3.
We also verified that it happens on both X86 and ARM architectures. So far we haven't hit the 1024 fd limit with Node 18 in any repo.
We've identified the issue and are working on a fix, although I haven't the foggiest idea why it doesn't fail in earlier versions of Node.
I expect it to ship early next week, and will closet his ticket when the release is successful.
Thank you everyone for the patience!
We've merged the fix and moved this along to our internal test service. I intend to let it stress-test for several hours before releasing, and will update this ticket when that occurs.
Thanks!
Hey folks, this should be fixed in v103 We still need to tie up a few loose ends but the major cause of FD leaks should be fixed here.
Thanks!
Thanks! We can verify that fd usage is now stable.
Loose ends tied up here: https://github.com/DataDog/datadog-lambda-js/pull/456
Expected Behavior
Function works the same as with Node 18
Actual Behavior
Function error rate is about 1-2% with following error in logs (added new lines for easier reading)
Steps to Reproduce the Problem
I don't have isolated reproduction steps for this, but so far we have seen this with all the lambda functions we have tried to update to node 20. Seems to happen at least with lambdas that use AWS SDK v3 or that do HTTPS requests with
got
v13. We are currently trying to implement a minimal reproduction case.Specifications
arn:aws:lambda:eu-west-1:464622532012:layer:Datadog-Node20-x:102
arn:aws:lambda:eu-west-1:464622532012:layer:Datadog-Extension-ARM:51
Stacktrace