DataDog / dd-trace-js

JavaScript APM Tracer
https://docs.datadoghq.com/tracing/
Other
640 stars 303 forks source link

memory leak observed since 2.11 #2640

Closed amirbarkai closed 9 months ago

amirbarkai commented 1 year ago

Expected behaviour we recently upgraded from an old dd-trace version 0.31, to the latest 3.9.3 (big jump). Since the upgrade, we're experiencing a notable memory leak. after some investigation, mainly applying binary search on the dd-trace version, we concluded that the leak starts in version 2.11. (2.10 is the last version that our benchmark didn't have the leak). its happening in a much prominent way in pubsub's subscription heavy services.

the load testing & benchmark were made with the same script, for the same duration.

Actual behaviour no leak ?

Environment

image
tlhunter commented 1 year ago

From the memory usage screenshot it looks like the memory growth is bounded, or in other words, the library consumes a flat increase in memory (approx 50MB) but that memory consumption doesn't continuously grow. Does that observation align with what you're seeing?

amirbarkai commented 1 year ago

nope. that's not correct. its bounded just because the stress test ran for 15 minute each time, and it grow throughout that time. after the test ends, GC is performed, but not goes back to its original memory size, but rather reduced by 10-15% only. in our production, the fact that GC not clearing it overtime, causes OOM. its a bit diffecult to see this in the image attached, because there are cases where we saw the increase, and we concluded straight away its a "leaked" version, and moved on to deploy a new version.

there was something in 2.11 that started to consume drastically more memory and not getting freed.

tlhunter commented 1 year ago

Can you enable runtime metrics via DD_RUNTIME_METRICS_ENABLED=true and also enable tracking of any spans that might be leaking via DD_TRACE_EXPERIMENTAL_SPAN_COUNTS=true? With those enabled please share the resulting runtime.node.spans.open and runtime.node.spans.open.by.name metrics with me via email at team-apmintegrationsnode {at} datadoghq [dot] com.

With that information I should be able to deduce any non garbage collected metrics and which integration it's associated with.

If those metrics come back empty then I'll probably need a heapdump of your process before and after a benchmark. That's usually a little more involved to collect so the metrics are a fine first step.

amirbarkai commented 1 year ago

Will do. Thanks

floriantz commented 1 year ago

I can confirm there is an issue with memory usage after bumping up from the 2.9.1 version we were using. This is the RAM usage of our node.js app on a container with 4Gb memory:

Screenshot 2023-01-19 at 11 03 43

After bumping down the version, we went back to our usual pattern of hovering around 15% RAM usage. It looked like the memory was very often freed during the night around midnight so we didn't get oom trouble but still.

sjgupta19 commented 1 year ago

We upgraded the dd-trace package from 2.10.0 to 2.23.0 and started seeing some spikes in the pubsub consumers. image

DavitSo commented 1 year ago

Hi All, Is there any update on this? image

Shoshvany commented 1 year ago

tried "vanilla" implementation with ver 4.1 (single high memory pod, single index JS):

` const datadog = require('dd-trace'); const { PubSub: GooglePubSub } = require('@google-cloud/pubsub');

const environment = 'staging';
const serviceName = 'memoryleak-benchmark';
const projectId = 'climacell-platform-staging';
const logger = console;

function initializeDatadog() {
    datadog.init({
        logInjection: true,
        service: serviceName,
        env: environment,
        logLevel: 'debug',
    });
}

function initializePubSubPulling() {
    const pubsub = new GooglePubSub({ projectId });

    const subscriberOptions = {
        ackDeadline: 100,
        flowControl: {
            maxMessages: 20,
            allowExcessMessages: false,
        },
    };

    const subscription = pubsub.subscription('memory-leak-test-subscription', subscriberOptions);

    logger.info('Ready to receive messages');

    async function errorHandler(error) {
        logger.error('Error in subscription', { error });
    }

    subscription.on('message', handler);
    subscription.on('error', errorHandler);

    let closed = false;

    function exitHandler() {
        if (!closed) {
            logger.info('Closing pubsub subscriber');
            subscription.removeListener('message', handler);
            subscription.removeListener('message', errorHandler);
            subscription.close();
            closed = true;
        }
    }

    process.on('exit', exitHandler);
    process.on('SIGINT', exitHandler);
    process.on('SIGUSR1', exitHandler);
    process.on('SIGUSR2', exitHandler);
}

async function handler(message) {
    message.ack();
    logger.info('handler called');
}

initializeDatadog();
initializePubSubPulling();

`

with -- inspect (node 16)

image each test run with high rps and large payloads

checked one hour after each test image image

Datadog objects increase during the load and barely change after using version 2.10 for the same test image

rochdev commented 1 year ago

@Shoshvany Can you share the rest of your test setup? For example, how are you sending load to the app?

rochdev commented 1 year ago

I tried to set up my own stress test, but I end up seeing a memory leak regardless of the version of dd-trace, and even without dd-trace, so I'm not sure how to properly look into the issue since it seems there is a leak in the PubSub library itself.

@Shoshvany I'll definitely need your test setup to investigate further.

In the meantime, I'll look into changes that happen between 2.10.0 and 2.11.0. There were changes to http2 and grpc-js integrations which are used by PubSub, so the problem might be in one of those.

Shoshvany commented 1 year ago

@rochdev I'm using another pod with a dedicated endpoint (that uses PubSub publish) and use BlazeMeter to generate load on that endpoint.

Its important to say that I use large PubSub payloads (~22k json) in order to see the problem

Shoshvany commented 1 year ago

@rochdev, I think you are correct, added tracer.use("grpc", false) after datadog.init and memory leak disappeared

mmarteauWoop commented 1 year ago

We experienced similar memory leaks on our apps. Disabling grpc plugin by using DD_TRACE_DISABLED_PLUGINS=grpc environment variable solved the problem.

I made a minimal reproduction to be sure that it was indeed the problem and not the google pubsub library itself:

import {PubSub} from '@google-cloud/pubsub';

const pubsub = new PubSub()

try {
  await new Promise(resolve => setTimeout(resolve, 10000))
  await pubsub.createSubscription('localTopic', 'localSubscription')
  console.log('created subscription')
} catch (e) {
  console.log('subscription already exists')
}

const subscription = pubsub.subscription('localSubscription')

let i = 0
subscription.on('message', message => {
  i++
  message.ack()
})

I then publish 500 events, wait for a while and then send a massive 10k events (1 event every 10ms).

Here is the result with --require dd-trace/init:

image image image

And with grpc datadog plugin disabled environment variable:

image image image

As you can see the garbage collector fails to collect a lot of things with grpc plugin enabled. After checking with the profiler it's really obvious.

This is something quite annoying for us as it made us scale up on our pods for nothing...

tlhunter commented 1 year ago

Can you upgrade to v4.7.0 / v3.28.0 / v2.41.0 and see if the memory leak is any better? In those releases we included a big change to gRPC with the intent to fix the memory leak.

tlhunter commented 9 months ago

Looks like this might have been solved by the aforementioned grpc changes. I'll close it for now but feel free to reopen if I'm off.