Memory leak in version > 3.20.0

kurigashvili-godaddy commented 2 years ago

Describe the bug

We have upgraded to 3.24.0 all our team services (10) and all of them started running out of memory.

All of them are running in K8S. It would take between 30 min to 3 hours of a pod run out of memory and restart.

We downgraded to 3.20.0 and the issue disappeared.

We were not seeing issue in a staging environment so I believe the issue can be noticed under higher loads.

To Reproduce

Steps to reproduce the behavior: Run http service with high load and observe that memory slowly increases until eventually node process dies with "JavaScript heap out of memory" error.

Expected behavior

Do not run our of memory like version 3.20.0

Environment (please complete the following information)

OS: [Linux]
Node.js version: 12.22.1
APM Server version: 7.15.2
Agent version: 3.24.0

How are you starting the agent? (please tick one of the boxes)

[x] Calling agent.start() directly (e.g. require('elastic-apm-node').start(...))
[ ] Requiring elastic-apm-node/start from within the source code
[ ] Starting node with -r elastic-apm-node/start

Additional context

Agent config options

Click to expand
``` replace this line with your agent config options ```
package.json dependencies:

Click to expand
``` replace this line with your dependencies section from package.json ```

astorm commented 2 years ago

Thanks for reporting @kurigashvili-godaddy, that's definitely not the intended behavior.

Usually the fastest path to resolution on these sorts of issues is an isolated bit of code that reproduces the problem. Do you happen to have an isolated reproduction of this problem (GitHub repository, Dockerfile)? If not, could you share what these services are doing? i.e. what web framework you're using, what libraries you're using to query databases/datastores, any third party promise libraries you might be using, etc.

Basically any information that would help us reproduce the problem so we can diagnose what's going on.

kurigashvili-godaddy commented 2 years ago

I will try to create an isolated reproduction of this problem. The simplest service that had the problem: crud service based on express 4.17.1 talking to dynamo db (aws-sdk 2.983.0).

trentm commented 2 years ago

@kurigashvili-godaddy Thanks. Some other questions that might help us isolate the problematic changes:

Is using the aws-sdk, and specifically DynamoDB, common to all the services that were running out of memory/leaking?
If you are able to try separately with versions v3.21.0 and v3.22.0 of the Node.js APM agent, that would help us segment agent changes that could be the culprit.

kurigashvili-godaddy commented 2 years ago

All of the services are using DynamoDB
I will see if I can run v3.21.0 and v3.22.0. This issue only happened in prod so it is a little tricky to be testing in prod.

I tried reproduce the issue locally but I was not able to reproduce :(

trentm commented 2 years ago

This issue only happened in prod so it is a little tricky to be testing in prod.

Definitely understood.

I haven't reproduced with a trivial Express-using app (node v12.22.6, elastic-apm-node@3.24.0) loaded up with hey -c 50 -q 10 -z 180m http://localhost:3000/. Obviously this isn't representative.

const apm = require('elastic-apm-node').start({
    serviceName: 'trentm-leakplay'
})

const express = require('express')
const app = express()
const port = 3000

function doAThing(opts, cb) {
  setTimeout(cb, Math.random() * 10)
}

app.get('/', (req, res) => {
  const s = apm.startSpan('thing')
  doAThing({}, function () {
    s.end()
    res.send('pong')
  })
})

app.listen(port, () => {
  console.log(`listening at http://localhost:${port}`)
})

vkrivtsov-stuzo commented 2 years ago

@trentm we have the same issue with this version on particular services, most of the services are doing well, I will provide the report a bit later

trentm commented 2 years ago

@vkrivtsov-stuzo I would love to see details on your experiences here. Anything to isolate common dependencies or exact APM agent versions that could be causing the leak.

Tirke commented 2 years ago

Hello, we also have some strange memory issues recently under load. Do you guys have any update on that? We are also using DynamoDB in many of our services.

@kurigashvili-godaddy maybe one of the difference between your local test and your production is that you have more errors logged in production?

@trentm From a heap dump in production I see a lot of memory being retained by AsyncHooksRunContextManager (biggest Map is also related to that)

trentm commented 2 years ago

@Tirke I'd love to get some details from you to help track this down, or for me to be able to repro.

What version(s) of node have you seen this with?
What version(s) of the APM agent have you seen this with? If you are able to try with any of the ??? versions below, I would very much appreciate it, but I understand if you cannot. So far we have:
- v3.24.0 leak (this added AsyncHooksRunContextManager and other significant changes, so is a likely leak cause candidate)
- v3.23.0 ??? (only had a change to support @elastic/elasticsearch@8 so unlikely to be a leak cause)
- v3.22.0 ???
- v3.21.x ???
- v3.20.0 no leak
Would you be willing to privately share your heap dump so that I can take a look? Again I understand if you cannot. Here is my work email if so.
Are you able to share the package.json "dependencies" for your application? It is interesting that you both use DynamoDB. However, our instrumentation for aws-sdk hasn't changed between v3.20 and v3.24.
Any other details you can think of that might help me reproduce would be much appreciated.

trentm commented 2 years ago

@Tirke Another question:

If you configure with asyncHooks: false (which will take AsyncHooksRunContextManager out of the picture) does that change the memory leak behaviour?

Tirke commented 2 years ago

@trentm

As we have prod secrets in our dumps, I'm not sure sending them is a possibility but I'll see what we can do.

Here are two separated dumps extracts from the same service with one hour interval between the two. The first dump is 30 minutes after service booting up so the last one is 1h30 after first service boot. Unfortunately, after 1h30 this service is crashing with OOM errors.

First screen 30 minutes after boot:

Second one 1h after the first:

I think that AsyncHooksRunContextManager, Transaction and RunContext are all coming from the lib and taking more and more retained size in just one hour.

Everything is running in ECS containers with 2vCPU and 4Go memory, in node 16 Alpine containers. Node is running with -max-old-space-size=3584

Our services dependencies:

"@aws-sdk/client-dynamodb": "^3.37.0",
"@aws-sdk/client-s3": "^3.37.0",
"@aws-sdk/client-secrets-manager": "^3.37.0",
"@aws-sdk/credential-provider-node": "^3.38.0",
"@aws-sdk/s3-request-presigner": "^3.37.0",
"@elastic/elasticsearch": "^7.15.0",
"@godaddy/terminus": "^4.9.0",
"@nestjs/axios": "^0.0.3",
"@nestjs/common": "^8.1.1",
"@nestjs/core": "^8.1.1",
"@nestjs/jwt": "^8.0.0",
"@nestjs/passport": "^8.0.1",
"@nestjs/platform-fastify": "^8.1.1",
"@nestjs/schedule": "^1.0.2",
"@nestjs/swagger": "^5.1.1",
"@nestjs/terminus": "^8.0.1",
"@prisma/client": "3.2.1",
"@types/bcrypt": "^5.0.0",
"algoliasearch": "^4.10.5",
"axios": "^0.23.0",
"bcrypt": "^5.0.1",
"chalk": "^4.1.2",
"class-transformer": "0.4.0",
"class-validator": "^0.13.1",
"cls-rtracer": "^2.6.0",
"crypto-random-string": "^3.3.1",
"dotenv": "^10.0.0",
"dotenv-flow": "^3.2.0",
"elastic-apm-node": "^3.26.0",
"email-deep-validator": "^3.3.0",
"fastify-cookie": "^5.3.1",
"fastify-passport": "^0.4.3",
"fastify-secure-session": "^3.0.0",
"fastify-swagger": "^4.12.4",
"fs-extra": "^10.0.0",
"jest-mock-extended": "^2.0.4",
"launchdarkly-node-server-sdk": "^6.2.0",
"lodash": "^4.17.21",
"logfmt": "^1.3.2",
"luxon": "^2.0.2",
"nest-router": "^1.0.9",
"nest-winston": "^1.6.1",
"newrelic": "^8.5.0",
"npm-config": "^0.2.0",        
"passport": "^0.5.0",
"passport-jwt": "^4.0.0",
"passport-oauth2": "^1.6.1",
"pretty-bytes": "^5.6.0",
"qs": "^6.10.1",
"reflect-metadata": "^0.1.13",
"rimraf": "^3.0.2",
"rxjs": "^7.4.0",
"sqreen": "^1.64.2",
"uuid": "^8.3.2",
"winston": "^3.3.3"

We will deploy with asyncHooks: false soon, will keep you updated.

Tirke commented 2 years ago

Hello, deployed with asyncHooks: false and elastic-apm-node: 3.25.0 this morning.

Screenshot 2021-12-16 at 11 24 32

Red line is the rolling update of our services. Very stable average memory usage since the change. The two usuals suspects (red and Earls Green lines) are no longer rapidly increasing their memory usage.

karankumarslicepay commented 2 years ago

@Tirke @trentm I am facing the same problem when I upgraded the mongoose version to v6

This is the log <symbol>::"[object Object] at Transaction.createSpan (/home/app/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:146:10) at Transaction.startSpan (/home/app/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:120:21) at Instrumentation.startSpan (/home/app/node_modules/elastic-apm-node/lib/instrumentation/index.js:420:26) at Agent.startSpan (/home/app/node_modules/elastic-apm-node/lib/agent.js:172:42) at MongoClient.onStart (/home/app/node_modules/elastic-apm-node/lib/instrumentation/modules/mongodb.js:73:24) at MongoClient.emit (events.js:314:20) at MongoClient.EventEmitter.emit (domain.js:483:12) at Topology.<anonymous> (/home/app/node_modules/mongoose/node_modules/mongodb/lib/operations/connect.js:61:53) at Topology.emit (events.js:314:20) at Topology.EventEmitter.emit (domain.js:483:12) at Server.<anonymous> (/home/app/node_modules/mongoose/node_modules/mongodb/lib/sdam/topology.js:558:42) at Server.emit (events.js:314:20) at Server.Eve"@768355🗖

Can you share the config? I don't know where to add this config asynchook:false Also, is it the global setting for Nodejs?

trentm commented 2 years ago

@karankumarslicepay The asyncHooks: false setting is referring to this Elastic Node.js APM agent config var. To be clear, this is a workaround for this issue. Setting asyncHooks: false changes the APM agent's internal tracking of async tasks in Node.js to a limited technique that cannot track parent-child relationships in spans in some situations.

I am facing the same problem when I upgraded the mongoose version to v6

I think there could be something else going on for your case, or at least some issues specific to your case. Would you be willing to open a separate issue so we can discuss there? Thanks.

trentm commented 2 years ago

Some notes and requests for help for those following this (I see a number of people ❤️ 'd @Tirke's helpful comment showing that the asyncHooks: false workaround helps):

Timing: I was able to start spending real time on this yesterday, but today is my last work day before the end of year holidays so there will be a delay on updates.
Ruling out versions v3.22.0 or v3.21.0 of the Node.js APM agent: I strongly suspect the issue is due to a change introduced in v3.24.0 of the APM agent. However, I've had one (private) possible note that someone has seen increased memory usage by the agent in v3.22.0. If anyone that is seeing this issue is able to try v3.22.0 and/or v3.21.0 of the agent and confirm whether or not that "fixes" the increased memory usage, that would really help.

Some details on my current understanding/guesses

Internal details that you can skip if not interested: Version 3.24.0 of the Node.js APM agent introduced a significant change in how the agent tracks async tasks. Part of the change involves the agent having a "run context" object for each async task -- the things that are put on the Node.js event loop. Those run context objects have a reference to the transaction and spans that a given async task is part of (e.g. the run context for an HTTP request handler has a reference to the APM transaction that collects the trace data for that HTTP request/response).

When a transaction ends (e.g. the HTTP response has been sent), the transaction object stays in memory until all the associated async tasks complete. Two things can happen in an application here that will both result in increased memory usage:

If some of those associated async tasks complete, but only complete long after the response has been sent, then the APM transaction object will stay in memory for that "long" time after the response. For example doSomeTask in the following lives for 10s after the HTTP response has ended:

const apm = require('elastic-apm-node').start()
const express = require('express')
const app = express()
const port = 3000

// Some async task that takes 10s to complete in the background.
async function doSomeTask () {
  return new Promise((resolve) => { 
    setTimeout(resolve, 10000) 
  })
}

app.get('/ping', (req, res) => {
  doSomeTask()
  res.send('pong')
})

app.listen(port, () => {
  console.log(`listening at http://localhost:${port}`)
})

If the application is under constant load, then this results in higher memory usage. The APM agent exacerbates a situation like this because the APM transaction objects can be large (for large requests) compared to a small Promise.

If there is a leak of async tasks in the application (e.g. leaked Promises) then, with the APM agent, those can keep APM transactions alive in memory. By themselves Promises are very small, and an application could leak many of them without noticing much increased memory usage. However, with indirectly attached APM transactions, the memory usage will be made much worse. One way to see if your application has a growing number of Promises in memory is by including this snippet of code (and the APM agent can be disabled to exclude it):

const promises = new Set()
require('async_hooks').createHook({
  init: (asyncId, type) => {
    if (type === 'PROMISE') {
      promises.add(asyncId)
    }
  },
  destroy: (asyncId) => {
    promises.delete(asyncId)
  }
}).enable()
setInterval(() => {
  console.log(`Promises in memory: ${promises.size}`)
}, 5000)

vemundeldegard commented 2 years ago

@trentm Happy holidays! I recently added APM to our NodeJS application. I instantly had memory troubles. After downgrading to v3.23.0 + setting asyncHooks: false the issues are no longer present. I don't know if asyncHooks has any affect on versions prior to 3.24.0?

trentm commented 2 years ago

@vemundeldegard Hi! If you are able to try with v3.23.0 and asyncHooks: true (the default value), then that would be helpful/useful information.

If my understanding of the issue is correct, then there should only be increased memory pressure with v3.24.0 or later of the agent and with asyncHooks: true.

trentm commented 2 years ago

All,

elastic-apm-node@3.27.0 was just released, which includes a fix (https://github.com/elastic/apm-agent-nodejs/pull/2528) for potential high-memory use in versions 3.24.0 - 3.26.0 of the agent. I hope this resolves the issues that all or most of you were seeing, but I cannot be sure.

If/when you are able to try v3.27.0 -- and without the asyncHooks: false workaround -- I would love to hear whether or not this fixes the high-memory usage you were seeing. If not, then if you are able, I'd like to work with you to try to understand what you are seeing.

The quick summary of my understanding of the issue, if you are interested, is:

Starting with v3.24.0, a side-effect of improvements in the agent's tracking of asynchronous tasks was that a current APM Transaction object is not freed until all its async tasks end. Before this it was freed when the transaction ended.
This means a possibly longer lifetime for those APM Transaction objects, depending on the user application -- whether the application code results in async tasks that out-live the transaction in which they were started.
Those longer-lived Transaction objects were holding references to HTTP request/response objects, which meant higher memory usage for an application under constant load.

KnutFr commented 2 years ago

Hi @trentm , I'm working with @Tirke and we've just release an update on our stack with the latest patch of APM (3.27.0) without using the asyncHooks workaround and so far so good.

Screenshot 2022-01-20 at 21 47 48

Thanks a lot for your support.

trentm commented 2 years ago

All, I'm going to close this now. My hope is that this resolves the issue for everyone that posted about it. I'd still very much welcome confirmation (like from @KnutFr above, thanks!) one way or the other. If it doesn't resolve the memory-usage issue for you, please feel free to open a new issue so we can dig into it (or re-open this one).

jgrayauskie commented 2 years ago

Thanks @trentm just recently we upgraded to 3.27.0 in one of our applications and no longer see the memory issues! Thanks again. ❤️

elastic / apm-agent-nodejs

Memory leak in version > 3.20.0 #2489

Some details on my current understanding/guesses