Closed kurigashvili-godaddy closed 2 years ago
Thanks for reporting @kurigashvili-godaddy, that's definitely not the intended behavior.
Usually the fastest path to resolution on these sorts of issues is an isolated bit of code that reproduces the problem. Do you happen to have an isolated reproduction of this problem (GitHub repository, Dockerfile)? If not, could you share what these services are doing? i.e. what web framework you're using, what libraries you're using to query databases/datastores, any third party promise libraries you might be using, etc.
Basically any information that would help us reproduce the problem so we can diagnose what's going on.
I will try to create an isolated reproduction of this problem. The simplest service that had the problem: crud service based on express 4.17.1 talking to dynamo db (aws-sdk 2.983.0).
@kurigashvili-godaddy Thanks. Some other questions that might help us isolate the problematic changes:
I tried reproduce the issue locally but I was not able to reproduce :(
- This issue only happened in prod so it is a little tricky to be testing in prod.
Definitely understood.
I haven't reproduced with a trivial Express-using app (node v12.22.6, elastic-apm-node@3.24.0) loaded up with hey -c 50 -q 10 -z 180m http://localhost:3000/
. Obviously this isn't representative.
const apm = require('elastic-apm-node').start({
serviceName: 'trentm-leakplay'
})
const express = require('express')
const app = express()
const port = 3000
function doAThing(opts, cb) {
setTimeout(cb, Math.random() * 10)
}
app.get('/', (req, res) => {
const s = apm.startSpan('thing')
doAThing({}, function () {
s.end()
res.send('pong')
})
})
app.listen(port, () => {
console.log(`listening at http://localhost:${port}`)
})
@trentm we have the same issue with this version on particular services, most of the services are doing well, I will provide the report a bit later
@vkrivtsov-stuzo I would love to see details on your experiences here. Anything to isolate common dependencies or exact APM agent versions that could be causing the leak.
Hello, we also have some strange memory issues recently under load. Do you guys have any update on that? We are also using DynamoDB in many of our services.
@kurigashvili-godaddy maybe one of the difference between your local test and your production is that you have more errors logged in production?
@trentm From a heap dump in production I see a lot of memory being retained by AsyncHooksRunContextManager
(biggest Map is also related to that)
@Tirke I'd love to get some details from you to help track this down, or for me to be able to repro.
???
versions below, I would very much appreciate it, but I understand if you cannot. So far we have:
AsyncHooksRunContextManager
and other significant changes, so is a likely leak cause candidate)@elastic/elasticsearch@8
so unlikely to be a leak cause)@Tirke Another question:
asyncHooks: false
(which will take AsyncHooksRunContextManager
out of the picture) does that change the memory leak behaviour?@trentm
As we have prod secrets in our dumps, I'm not sure sending them is a possibility but I'll see what we can do.
Here are two separated dumps extracts from the same service with one hour interval between the two. The first dump is 30 minutes after service booting up so the last one is 1h30 after first service boot. Unfortunately, after 1h30 this service is crashing with OOM errors.
First screen 30 minutes after boot:
Second one 1h after the first:
I think that AsyncHooksRunContextManager
, Transaction
and RunContext
are all coming from the lib and taking more and more retained size in just one hour.
Everything is running in ECS containers with 2vCPU and 4Go memory, in node 16 Alpine containers. Node is running with -max-old-space-size=3584
Our services dependencies:
"@aws-sdk/client-dynamodb": "^3.37.0",
"@aws-sdk/client-s3": "^3.37.0",
"@aws-sdk/client-secrets-manager": "^3.37.0",
"@aws-sdk/credential-provider-node": "^3.38.0",
"@aws-sdk/s3-request-presigner": "^3.37.0",
"@elastic/elasticsearch": "^7.15.0",
"@godaddy/terminus": "^4.9.0",
"@nestjs/axios": "^0.0.3",
"@nestjs/common": "^8.1.1",
"@nestjs/core": "^8.1.1",
"@nestjs/jwt": "^8.0.0",
"@nestjs/passport": "^8.0.1",
"@nestjs/platform-fastify": "^8.1.1",
"@nestjs/schedule": "^1.0.2",
"@nestjs/swagger": "^5.1.1",
"@nestjs/terminus": "^8.0.1",
"@prisma/client": "3.2.1",
"@types/bcrypt": "^5.0.0",
"algoliasearch": "^4.10.5",
"axios": "^0.23.0",
"bcrypt": "^5.0.1",
"chalk": "^4.1.2",
"class-transformer": "0.4.0",
"class-validator": "^0.13.1",
"cls-rtracer": "^2.6.0",
"crypto-random-string": "^3.3.1",
"dotenv": "^10.0.0",
"dotenv-flow": "^3.2.0",
"elastic-apm-node": "^3.26.0",
"email-deep-validator": "^3.3.0",
"fastify-cookie": "^5.3.1",
"fastify-passport": "^0.4.3",
"fastify-secure-session": "^3.0.0",
"fastify-swagger": "^4.12.4",
"fs-extra": "^10.0.0",
"jest-mock-extended": "^2.0.4",
"launchdarkly-node-server-sdk": "^6.2.0",
"lodash": "^4.17.21",
"logfmt": "^1.3.2",
"luxon": "^2.0.2",
"nest-router": "^1.0.9",
"nest-winston": "^1.6.1",
"newrelic": "^8.5.0",
"npm-config": "^0.2.0",
"passport": "^0.5.0",
"passport-jwt": "^4.0.0",
"passport-oauth2": "^1.6.1",
"pretty-bytes": "^5.6.0",
"qs": "^6.10.1",
"reflect-metadata": "^0.1.13",
"rimraf": "^3.0.2",
"rxjs": "^7.4.0",
"sqreen": "^1.64.2",
"uuid": "^8.3.2",
"winston": "^3.3.3"
We will deploy with asyncHooks: false
soon, will keep you updated.
Hello, deployed with asyncHooks: false
and elastic-apm-node: 3.25.0
this morning.
Red line is the rolling update of our services. Very stable average memory usage since the change. The two usuals suspects (red and Earls Green lines) are no longer rapidly increasing their memory usage.
@Tirke @trentm I am facing the same problem when I upgraded the mongoose version to v6
This is the log
<symbol>::"[object Object] at Transaction.createSpan (/home/app/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:146:10) at Transaction.startSpan (/home/app/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:120:21) at Instrumentation.startSpan (/home/app/node_modules/elastic-apm-node/lib/instrumentation/index.js:420:26) at Agent.startSpan (/home/app/node_modules/elastic-apm-node/lib/agent.js:172:42) at MongoClient.onStart (/home/app/node_modules/elastic-apm-node/lib/instrumentation/modules/mongodb.js:73:24) at MongoClient.emit (events.js:314:20) at MongoClient.EventEmitter.emit (domain.js:483:12) at Topology.<anonymous> (/home/app/node_modules/mongoose/node_modules/mongodb/lib/operations/connect.js:61:53) at Topology.emit (events.js:314:20) at Topology.EventEmitter.emit (domain.js:483:12) at Server.<anonymous> (/home/app/node_modules/mongoose/node_modules/mongodb/lib/sdam/topology.js:558:42) at Server.emit (events.js:314:20) at Server.Eve"@768355🗖
Can you share the config? I don't know where to add this config asynchook:false
Also, is it the global setting for Nodejs?
@karankumarslicepay The asyncHooks: false
setting is referring to this Elastic Node.js APM agent config var. To be clear, this is a workaround for this issue. Setting asyncHooks: false
changes the APM agent's internal tracking of async tasks in Node.js to a limited technique that cannot track parent-child relationships in spans in some situations.
I am facing the same problem when I upgraded the mongoose version to v6
I think there could be something else going on for your case, or at least some issues specific to your case. Would you be willing to open a separate issue so we can discuss there? Thanks.
Some notes and requests for help for those following this (I see a number of people ❤️ 'd @Tirke's helpful comment showing that the asyncHooks: false
workaround helps):
Internal details that you can skip if not interested: Version 3.24.0 of the Node.js APM agent introduced a significant change in how the agent tracks async tasks. Part of the change involves the agent having a "run context" object for each async task -- the things that are put on the Node.js event loop. Those run context objects have a reference to the transaction and spans that a given async task is part of (e.g. the run context for an HTTP request handler has a reference to the APM transaction that collects the trace data for that HTTP request/response).
When a transaction ends (e.g. the HTTP response has been sent), the transaction object stays in memory until all the associated async tasks complete. Two things can happen in an application here that will both result in increased memory usage:
doSomeTask
in the following lives for 10s after the HTTP response has ended:const apm = require('elastic-apm-node').start()
const express = require('express')
const app = express()
const port = 3000
// Some async task that takes 10s to complete in the background.
async function doSomeTask () {
return new Promise((resolve) => {
setTimeout(resolve, 10000)
})
}
app.get('/ping', (req, res) => {
doSomeTask()
res.send('pong')
})
app.listen(port, () => {
console.log(`listening at http://localhost:${port}`)
})
If the application is under constant load, then this results in higher memory usage. The APM agent exacerbates a situation like this because the APM transaction objects can be large (for large requests) compared to a small Promise.
const promises = new Set()
require('async_hooks').createHook({
init: (asyncId, type) => {
if (type === 'PROMISE') {
promises.add(asyncId)
}
},
destroy: (asyncId) => {
promises.delete(asyncId)
}
}).enable()
setInterval(() => {
console.log(`Promises in memory: ${promises.size}`)
}, 5000)
@trentm Happy holidays! I recently added APM to our NodeJS application. I instantly had memory troubles. After downgrading to v3.23.0 + setting asyncHooks: false
the issues are no longer present. I don't know if asyncHooks has any affect on versions prior to 3.24.0?
@vemundeldegard Hi! If you are able to try with v3.23.0 and asyncHooks: true
(the default value), then that would be helpful/useful information.
If my understanding of the issue is correct, then there should only be increased memory pressure with v3.24.0 or later of the agent and with asyncHooks: true
.
All,
elastic-apm-node@3.27.0 was just released, which includes a fix (https://github.com/elastic/apm-agent-nodejs/pull/2528) for potential high-memory use in versions 3.24.0 - 3.26.0 of the agent. I hope this resolves the issues that all or most of you were seeing, but I cannot be sure.
If/when you are able to try v3.27.0 -- and without the asyncHooks: false
workaround -- I would love to hear whether or not this fixes the high-memory usage you were seeing. If not, then if you are able, I'd like to work with you to try to understand what you are seeing.
The quick summary of my understanding of the issue, if you are interested, is:
Hi @trentm , I'm working with @Tirke and we've just release an update on our stack with the latest patch of APM (3.27.0) without using the asyncHooks workaround and so far so good.
Thanks a lot for your support.
All, I'm going to close this now. My hope is that this resolves the issue for everyone that posted about it. I'd still very much welcome confirmation (like from @KnutFr above, thanks!) one way or the other. If it doesn't resolve the memory-usage issue for you, please feel free to open a new issue so we can dig into it (or re-open this one).
Thanks @trentm just recently we upgraded to 3.27.0
in one of our applications and no longer see the memory issues! Thanks again. ❤️
Describe the bug
We have upgraded to 3.24.0 all our team services (10) and all of them started running out of memory.
All of them are running in K8S. It would take between 30 min to 3 hours of a pod run out of memory and restart.
We downgraded to 3.20.0 and the issue disappeared.
We were not seeing issue in a staging environment so I believe the issue can be noticed under higher loads.
To Reproduce
Steps to reproduce the behavior: Run http service with high load and observe that memory slowly increases until eventually node process dies with "JavaScript heap out of memory" error.
Expected behavior
Do not run our of memory like version 3.20.0
Environment (please complete the following information)
How are you starting the agent? (please tick one of the boxes)
agent.start()
directly (e.g.require('elastic-apm-node').start(...)
)elastic-apm-node/start
from within the source code-r elastic-apm-node/start
Additional context
Agent config options
Click to expand
``` replace this line with your agent config options ```package.json
dependencies:Click to expand
``` replace this line with your dependencies section from package.json ```