elastic / apm-agent-nodejs

https://www.elastic.co/guide/en/apm/agent/nodejs/current/index.html
BSD 2-Clause "Simplified" License
582 stars 224 forks source link

Memory leak in version > 3.20.0 #2489

Closed kurigashvili-godaddy closed 2 years ago

kurigashvili-godaddy commented 2 years ago

Describe the bug

We have upgraded to 3.24.0 all our team services (10) and all of them started running out of memory.

All of them are running in K8S. It would take between 30 min to 3 hours of a pod run out of memory and restart.

We downgraded to 3.20.0 and the issue disappeared.

We were not seeing issue in a staging environment so I believe the issue can be noticed under higher loads.

To Reproduce

Steps to reproduce the behavior: Run http service with high load and observe that memory slowly increases until eventually node process dies with "JavaScript heap out of memory" error.

Expected behavior

Do not run our of memory like version 3.20.0

Environment (please complete the following information)

How are you starting the agent? (please tick one of the boxes)

Additional context

astorm commented 2 years ago

Thanks for reporting @kurigashvili-godaddy, that's definitely not the intended behavior.

Usually the fastest path to resolution on these sorts of issues is an isolated bit of code that reproduces the problem. Do you happen to have an isolated reproduction of this problem (GitHub repository, Dockerfile)? If not, could you share what these services are doing? i.e. what web framework you're using, what libraries you're using to query databases/datastores, any third party promise libraries you might be using, etc.

Basically any information that would help us reproduce the problem so we can diagnose what's going on.

kurigashvili-godaddy commented 2 years ago

I will try to create an isolated reproduction of this problem. The simplest service that had the problem: crud service based on express 4.17.1 talking to dynamo db (aws-sdk 2.983.0).

trentm commented 2 years ago

@kurigashvili-godaddy Thanks. Some other questions that might help us isolate the problematic changes:

kurigashvili-godaddy commented 2 years ago

I tried reproduce the issue locally but I was not able to reproduce :(

trentm commented 2 years ago
  • This issue only happened in prod so it is a little tricky to be testing in prod.

Definitely understood.

I haven't reproduced with a trivial Express-using app (node v12.22.6, elastic-apm-node@3.24.0) loaded up with hey -c 50 -q 10 -z 180m http://localhost:3000/. Obviously this isn't representative.

const apm = require('elastic-apm-node').start({
    serviceName: 'trentm-leakplay'
})

const express = require('express')
const app = express()
const port = 3000

function doAThing(opts, cb) {
  setTimeout(cb, Math.random() * 10)
}

app.get('/', (req, res) => {
  const s = apm.startSpan('thing')
  doAThing({}, function () {
    s.end()
    res.send('pong')
  })
})

app.listen(port, () => {
  console.log(`listening at http://localhost:${port}`)
})
vkrivtsov-stuzo commented 2 years ago

@trentm we have the same issue with this version on particular services, most of the services are doing well, I will provide the report a bit later

trentm commented 2 years ago

@vkrivtsov-stuzo I would love to see details on your experiences here. Anything to isolate common dependencies or exact APM agent versions that could be causing the leak.

Tirke commented 2 years ago

Hello, we also have some strange memory issues recently under load. Do you guys have any update on that? We are also using DynamoDB in many of our services.

@kurigashvili-godaddy maybe one of the difference between your local test and your production is that you have more errors logged in production?

@trentm From a heap dump in production I see a lot of memory being retained by AsyncHooksRunContextManager (biggest Map is also related to that) image

trentm commented 2 years ago

@Tirke I'd love to get some details from you to help track this down, or for me to be able to repro.

trentm commented 2 years ago

@Tirke Another question:

Tirke commented 2 years ago

@trentm

As we have prod secrets in our dumps, I'm not sure sending them is a possibility but I'll see what we can do.

Here are two separated dumps extracts from the same service with one hour interval between the two. The first dump is 30 minutes after service booting up so the last one is 1h30 after first service boot. Unfortunately, after 1h30 this service is crashing with OOM errors.

First screen 30 minutes after boot: image

Second one 1h after the first: image

I think that AsyncHooksRunContextManager, Transaction and RunContext are all coming from the lib and taking more and more retained size in just one hour.

Everything is running in ECS containers with 2vCPU and 4Go memory, in node 16 Alpine containers. Node is running with -max-old-space-size=3584

Our services dependencies:

"@aws-sdk/client-dynamodb": "^3.37.0",
"@aws-sdk/client-s3": "^3.37.0",
"@aws-sdk/client-secrets-manager": "^3.37.0",
"@aws-sdk/credential-provider-node": "^3.38.0",
"@aws-sdk/s3-request-presigner": "^3.37.0",
"@elastic/elasticsearch": "^7.15.0",
"@godaddy/terminus": "^4.9.0",
"@nestjs/axios": "^0.0.3",
"@nestjs/common": "^8.1.1",
"@nestjs/core": "^8.1.1",
"@nestjs/jwt": "^8.0.0",
"@nestjs/passport": "^8.0.1",
"@nestjs/platform-fastify": "^8.1.1",
"@nestjs/schedule": "^1.0.2",
"@nestjs/swagger": "^5.1.1",
"@nestjs/terminus": "^8.0.1",
"@prisma/client": "3.2.1",
"@types/bcrypt": "^5.0.0",
"algoliasearch": "^4.10.5",
"axios": "^0.23.0",
"bcrypt": "^5.0.1",
"chalk": "^4.1.2",
"class-transformer": "0.4.0",
"class-validator": "^0.13.1",
"cls-rtracer": "^2.6.0",
"crypto-random-string": "^3.3.1",
"dotenv": "^10.0.0",
"dotenv-flow": "^3.2.0",
"elastic-apm-node": "^3.26.0",
"email-deep-validator": "^3.3.0",
"fastify-cookie": "^5.3.1",
"fastify-passport": "^0.4.3",
"fastify-secure-session": "^3.0.0",
"fastify-swagger": "^4.12.4",
"fs-extra": "^10.0.0",
"jest-mock-extended": "^2.0.4",
"launchdarkly-node-server-sdk": "^6.2.0",
"lodash": "^4.17.21",
"logfmt": "^1.3.2",
"luxon": "^2.0.2",
"nest-router": "^1.0.9",
"nest-winston": "^1.6.1",
"newrelic": "^8.5.0",
"npm-config": "^0.2.0",        
"passport": "^0.5.0",
"passport-jwt": "^4.0.0",
"passport-oauth2": "^1.6.1",
"pretty-bytes": "^5.6.0",
"qs": "^6.10.1",
"reflect-metadata": "^0.1.13",
"rimraf": "^3.0.2",
"rxjs": "^7.4.0",
"sqreen": "^1.64.2",
"uuid": "^8.3.2",
"winston": "^3.3.3"

We will deploy with asyncHooks: false soon, will keep you updated.

Tirke commented 2 years ago

Hello, deployed with asyncHooks: false and elastic-apm-node: 3.25.0 this morning.

Screenshot 2021-12-16 at 11 24 32

Red line is the rolling update of our services. Very stable average memory usage since the change. The two usuals suspects (red and Earls Green lines) are no longer rapidly increasing their memory usage.

karankumarslicepay commented 2 years ago

@Tirke @trentm I am facing the same problem when I upgraded the mongoose version to v6

This is the log <symbol>::"[object Object] at Transaction.createSpan (/home/app/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:146:10) at Transaction.startSpan (/home/app/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:120:21) at Instrumentation.startSpan (/home/app/node_modules/elastic-apm-node/lib/instrumentation/index.js:420:26) at Agent.startSpan (/home/app/node_modules/elastic-apm-node/lib/agent.js:172:42) at MongoClient.onStart (/home/app/node_modules/elastic-apm-node/lib/instrumentation/modules/mongodb.js:73:24) at MongoClient.emit (events.js:314:20) at MongoClient.EventEmitter.emit (domain.js:483:12) at Topology.<anonymous> (/home/app/node_modules/mongoose/node_modules/mongodb/lib/operations/connect.js:61:53) at Topology.emit (events.js:314:20) at Topology.EventEmitter.emit (domain.js:483:12) at Server.<anonymous> (/home/app/node_modules/mongoose/node_modules/mongodb/lib/sdam/topology.js:558:42) at Server.emit (events.js:314:20) at Server.Eve"@768355🗖

Can you share the config? I don't know where to add this config asynchook:false Also, is it the global setting for Nodejs?

trentm commented 2 years ago

@karankumarslicepay The asyncHooks: false setting is referring to this Elastic Node.js APM agent config var. To be clear, this is a workaround for this issue. Setting asyncHooks: false changes the APM agent's internal tracking of async tasks in Node.js to a limited technique that cannot track parent-child relationships in spans in some situations.

I am facing the same problem when I upgraded the mongoose version to v6

I think there could be something else going on for your case, or at least some issues specific to your case. Would you be willing to open a separate issue so we can discuss there? Thanks.

trentm commented 2 years ago

Some notes and requests for help for those following this (I see a number of people ❤️ 'd @Tirke's helpful comment showing that the asyncHooks: false workaround helps):

Some details on my current understanding/guesses

Internal details that you can skip if not interested: Version 3.24.0 of the Node.js APM agent introduced a significant change in how the agent tracks async tasks. Part of the change involves the agent having a "run context" object for each async task -- the things that are put on the Node.js event loop. Those run context objects have a reference to the transaction and spans that a given async task is part of (e.g. the run context for an HTTP request handler has a reference to the APM transaction that collects the trace data for that HTTP request/response).

When a transaction ends (e.g. the HTTP response has been sent), the transaction object stays in memory until all the associated async tasks complete. Two things can happen in an application here that will both result in increased memory usage:

  1. If some of those associated async tasks complete, but only complete long after the response has been sent, then the APM transaction object will stay in memory for that "long" time after the response. For example doSomeTask in the following lives for 10s after the HTTP response has ended:
const apm = require('elastic-apm-node').start()
const express = require('express')
const app = express()
const port = 3000

// Some async task that takes 10s to complete in the background.
async function doSomeTask () {
  return new Promise((resolve) => { 
    setTimeout(resolve, 10000) 
  })
}

app.get('/ping', (req, res) => {
  doSomeTask()
  res.send('pong')
})

app.listen(port, () => {
  console.log(`listening at http://localhost:${port}`)
})

If the application is under constant load, then this results in higher memory usage. The APM agent exacerbates a situation like this because the APM transaction objects can be large (for large requests) compared to a small Promise.

  1. If there is a leak of async tasks in the application (e.g. leaked Promises) then, with the APM agent, those can keep APM transactions alive in memory. By themselves Promises are very small, and an application could leak many of them without noticing much increased memory usage. However, with indirectly attached APM transactions, the memory usage will be made much worse. One way to see if your application has a growing number of Promises in memory is by including this snippet of code (and the APM agent can be disabled to exclude it):
const promises = new Set()
require('async_hooks').createHook({
  init: (asyncId, type) => {
    if (type === 'PROMISE') {
      promises.add(asyncId)
    }
  },
  destroy: (asyncId) => {
    promises.delete(asyncId)
  }
}).enable()
setInterval(() => {
  console.log(`Promises in memory: ${promises.size}`)
}, 5000)
vemundeldegard commented 2 years ago

@trentm Happy holidays! I recently added APM to our NodeJS application. I instantly had memory troubles. After downgrading to v3.23.0 + setting asyncHooks: false the issues are no longer present. I don't know if asyncHooks has any affect on versions prior to 3.24.0?

trentm commented 2 years ago

@vemundeldegard Hi! If you are able to try with v3.23.0 and asyncHooks: true (the default value), then that would be helpful/useful information.

If my understanding of the issue is correct, then there should only be increased memory pressure with v3.24.0 or later of the agent and with asyncHooks: true.

trentm commented 2 years ago

All,

elastic-apm-node@3.27.0 was just released, which includes a fix (https://github.com/elastic/apm-agent-nodejs/pull/2528) for potential high-memory use in versions 3.24.0 - 3.26.0 of the agent. I hope this resolves the issues that all or most of you were seeing, but I cannot be sure.

If/when you are able to try v3.27.0 -- and without the asyncHooks: false workaround -- I would love to hear whether or not this fixes the high-memory usage you were seeing. If not, then if you are able, I'd like to work with you to try to understand what you are seeing.


The quick summary of my understanding of the issue, if you are interested, is:

KnutFr commented 2 years ago

Hi @trentm , I'm working with @Tirke and we've just release an update on our stack with the latest patch of APM (3.27.0) without using the asyncHooks workaround and so far so good.

Screenshot 2022-01-20 at 21 47 48

Thanks a lot for your support.

trentm commented 2 years ago

All, I'm going to close this now. My hope is that this resolves the issue for everyone that posted about it. I'd still very much welcome confirmation (like from @KnutFr above, thanks!) one way or the other. If it doesn't resolve the memory-usage issue for you, please feel free to open a new issue so we can dig into it (or re-open this one).

jgrayauskie commented 2 years ago

Thanks @trentm just recently we upgraded to 3.27.0 in one of our applications and no longer see the memory issues! Thanks again. ❤️