elastic / apm-agent-nodejs

https://www.elastic.co/guide/en/apm/agent/nodejs/current/index.html
BSD 2-Clause "Simplified" License
582 stars 224 forks source link

Memory leak in v3.28.0 #2569

Closed ddolcimascolo closed 2 years ago

ddolcimascolo commented 2 years ago

Describe the bug

High memory usage in v3.28.0, up to a point where the Node processes become unresponsive. Restarting reproduces.
I just reverted to v3.27.0 and will observe charts for a few hours. I'll report here

In the screenshot below (from Kibana APM app itself), elastic-apm-node upgrade was merged on Feb. 9th at 3am (Paris time).This process is automatic. The app was restarted the same day at around 1pm

screenshot-kibana-01 storcloud org-2022 02 10-09_24_27

To Reproduce

Steps to reproduce the behavior:

  1. Upgrade a running application using elastic-apm-node v3.27.0 to v3.28.0

Expected behavior

Nothing special, APM keeps on collecting data as usual.

Environment (please complete the following information)

How are you starting the agent? (please tick one of the boxes)

Additional context

ddolcimascolo commented 2 years ago

Hi @trentm,

I'm willing to help diagnose the issue. What can I provide to further track down where the leak come from?

Cheers, David

ddolcimascolo commented 2 years ago

Side note, I see in the diff between 3.27.0 and 3.28.0 that you supported exit spans for ES (very nice BTW :pray:). Dunno if that's related or not, but we do use extended @elastic/elasticsearch Client and Transport classes.

  1. We extend Client to register some event handler for logging and plug-in our custom Tansport
class Client extends elasticsearch.Client {

  constructor() {
    super({
      Transport // Our custom Transport class
    });

    this.registerEventLoggers();
  }

  registerEventLoggers() {
    this.on('request', (err, result) => {
      const { connection: { url }, attempts, request: { id, params: { method, path, querystring, body } } } = result.meta;

      logger.debug('elasticsearch/%s >> (%d) %s %s%s?%s %s', id, attempts, method, url, path, querystring, body, { err });
    });

    this.on('response', (err, result) => {
      const { statusCode, body, meta: { request: { id } } } = result;

      logger.debug('elasticsearch/%s << %d %j', id, statusCode, body, { err });
    });

    this.on('sniff', (err, result) => {
      const { meta: { request: { id }, sniff: { reason, hosts } } } = result;

      logger.info('elasticsearch/%s << Sniffing (%s) returned %j', id, reason, hosts, { err });
    });
  }

}
  1. We extend Transport to override request(params, options, callback) to put some request options automatically and also unwrap the response body from the result object. This is what the legacy client was doing and we kept that when we upgraded to @elastic/elasticsearch client. Fun fact: this is what the client will do by default in v8 :)
class Transport extends elasticsearch.Transport {

  request(params, options, callback) {
    /**
     * For search requests we append ?timeout=xxx as a query parameter
     * This allows to transparently use a global configuration setting for all requests
     */
    if (/_search$/.test(params.path)) {
      _.defaultsDeep(params, {
        querystring: {
          timeout: `${config.ES_REQUEST_TIMEOUT}ms`
        }
      });
    }

    const result = super.request(params, options, callback);

    if (!callback && !(options && options.asStream)) {
      return result.then(({ body }) => body); // Unwraps body from the returned result, to emulate what the old ES client does
    }

    return result;
  }

}
ddolcimascolo commented 2 years ago

After reverting the lib to v3.27.0

screenshot-kibana-01 storcloud org-2022 02 10-11_33_20

trentm commented 2 years ago

@ddolcimascolo Thanks for reporting! Ugh, sorry about this. Yes, the best guess is probably the changes in the @elastic/elasticsearch instrumentation in commit 10f533e878f8b04781ab407a0ba9dc04909861a3 (#2557). The immediately preceding change is commit cff14650b3a5af12add139bc401fb2461a9b341d.

  1. I don't know how easy or disruptive it is for you to try, but a potentially interesting experiment would be to try with that preceding commit, but having this in your package.json:
  "elastic-apm-node": "elastic/apm-agent-nodejs#cff14650b3a5af12add139bc401fb2461a9b341d"

If that does not show the leak, then that almost certainly narrows it down to commit 10f533e8.

  1. Getting a heap snapshot of the process with agent v3.28.0 when it is has leaked a lot of memory would help.

The way I've done this in the past is to add this signal handler to my node.js app:

const path = require('path')
const fs = require('fs')
const v8 = require('v8')
console.warn(`"node ${path.basename(process.argv[1])}" is running as PID ${process.pid}: use 'kill -SIGUSR2 ${process.pid}' to create a heapsnapshot` )
process.on('SIGUSR2', function () {
  var stamp = new Date().toISOString().replace(/:/g, '-').split('.')[0]
  v8.getHeapSnapshot()
    .pipe(fs.createWriteStream(`leak-${stamp}.heapsnapshot`))
    .on('close', function () {
      console.warn(`Created leak-${stamp}.heapsnapshot`)
    })
})

This will contain internal details about your app, so I understand if you cannot share the file. Or you could share it with me privately and I'd delete it after looking.

I'll get trying to reproduce this shortly.

trentm commented 2 years ago

I'll get trying to reproduce this shortly.

I can repro with a simple express app that queries ES:

var AGENT_DIR = './'
// AGENT_DIR = '/Users/trentm/el/apm-agent-nodejs'
tracer = require(AGENT_DIR).start({
  serviceName: 'leak3280play',
  // serverUrl: 'http://host.docker.internal:8200', // when in docker
  logUncaughtExceptions: true,
  metricsInterval:'0s',
  breakdownMetrics:false,
  centralConfig:false,
  cloudProvider:'none',
  captureHeaders:false,
})
console.warn(`Using elastic-apm-node@${require(AGENT_DIR+'/package.json').version}`)

const { Client } = require('@elastic/elasticsearch')
const express = require('express')

const app = express()
const port = 3000
const client = new Client({
  node: `http://${process.env.ES_HOST || 'localhost'}:9200`
})

app.get('/', (req, res, next) => {
  client.search({ q: 'pants' }, function (err, result) {
    if (err) {
      next(err)
    } else {
      res.send(result)
    }
  })
})

app.listen(port, () => {
  console.log(`listening at http://localhost:${port}`)
})

I've moved the "latest" release of the agent back to v3.27.0 for now.

ddolcimascolo commented 2 years ago

Thx for checking @trentm. That was fast! I'll try to set things up on my side to get you a heap dump ASAP.

I can confirm that we're using express too, with some routes that bulk index documents and some others that query ES, in a way similar to your repo above.

David

trentm commented 2 years ago

@ddolcimascolo You can skip getting a heap dump. I understand the issue. I am just working through testing now. I am embarrassed at my mistake that caused this. Thanks for all the details!

trentm commented 2 years ago

elastic-apm-node@3.29.0 has been published with what I understand to be a fix for this.

ddolcimascolo commented 2 years ago

@trentm I can confirm that v3.29.0 fixes it. Thx again