Jeff-Lewis / cls-hooked

cls-hooked : CLS using AsynWrap or async_hooks instead of async-listener for node 4.7+
BSD 2-Clause "Simplified" License
762 stars 89 forks source link

Context is preserved between two runPromise calls #63

Open shlomokraus opened 4 years ago

shlomokraus commented 4 years ago

Running the command in sequence preserves the context. I would expect that each call would create a clean context, like is the case with run

Is that normal behavior?

    await namespace.runPromise(async () => {
        set("foo", "bar");
        const foo = get("foo");
        console.log("VAELU", foo);
      })

    await namespace.runPromise(async () => {
      const foo = get("foo");
      console.log("Value", foo); // Here is prints "bar" instead of undefined
    });
santanubasu commented 4 years ago

I'm seeing this as well. I store a transaction client via set("client", client). Client is released at end of transaction but in a subsequent runPromise call this client remains assigned to that property key, and throws an exception because an attempt is made to release the same client twice.

fabien commented 4 years ago

I think I'm seeing this in run calls as well, leading to weird issues between http requests. Once I started to clear any references (setting them to null), things were back to normal. This started appearing when switching from Node 10 to version 12.

yagodorea commented 4 years ago

I'm dealing with something like this. I have a system that receives HTTP requests and also subscribes to Pubsub topics, it creates a new context for each HTTP request (with ns.run()) and every Pubsub message (with ns.runAndReturn()).

Each transaction is supposed to generate a new transactionId (UUID), unless it is sent over by headers or in the message metadata, in that case a new context is created with but with the received transactionId, so we can trace the related transactions over our microservices.

The problem is we're seeing the same transactionId being repeated a lot across thousands of different unrelated requests, and we can only assume it's an issue with cls-hooked (and not the uuid generation lib).

I don't know if it's related to this issue, because I'm not using the ns.runPromise(), but it might be... I'm not yet able to consistently reproduce it, and as soon as I can, I'll post here or create a separate issue.

samwyma commented 3 years ago

Just come across this issue, too. Changed to using runAndReturn instead, and that seems to have fixed it for me. Should just be a literal swap out since it will be returning the promise of your async methods

SimonX200 commented 3 years ago

Have a look at the last comments from me (DavidShard) to the first answer https://stackoverflow.com/questions/66665783/memory-leak-in-cls-hooked-or-wrong-use

This is a memory leak bug in cls-hooked that can be fixed by:

30c30
<   this._contexts = new Map();
---
>   this._contexts = {};
72c72
<   let context = Object.create(this.active ? this.active : Object.prototype);
---
>   let context = {...this.active };
316c316
<         namespace._contexts.set(asyncId, namespace.active);
---
>         namespace._contexts[asyncId] = namespace.active;
326c326
<         const triggerIdContext = namespace._contexts.get(triggerId);
---
>         const triggerIdContext = namespace._contexts[triggerId];
328c328
<           namespace._contexts.set(asyncId, triggerIdContext);
---
>           namespace._contexts[asyncId] = triggerIdContext;
364c364
<       context = namespace._contexts.get(asyncId) || namespace._contexts.get(currentUid);
---
>       context = namespace._contexts[asyncId] || namespace._contexts[currentUid];
397c397
<       context = namespace._contexts.get(asyncId) || namespace._contexts.get(currentUid);
---
>       context = namespace._contexts[asyncId] || namespace._contexts[currentUid];
424c424
<       namespace._contexts.delete(asyncId);
---
>       delete namespace._contexts[asyncId];

The effect is that 900k await ns.runPromise() only leave 5000 contexts in memory.

Before node --max-old-space-size=100 was only able to execute 100k of them before the GC killed node.

Only: let context = {...this.active }; does not solve the memory leak alone. Only together with replacing the Map by a normal object the leak is gone.

The test code (ns_simple.js):


const httpContext = require('express-http-context');

async function t2(l) {
  if (l > 1) {
  }
}

async function t1() {
  let start = Date.now();
  for (let i = 0; i < 90000000; i++) {
    await httpContext.ns.runPromise(t2);
    if (i % 1000 === 0) {
      const now = Date.now();
      console.log(` ${i + 1}, dur=${now - start}ms`);
      start = now;
    }
  }
}

t1()

With node 14.15.5

time node --max-old-space-size=10 ns_simple.js

$ time node --max-old-space-size=10 ns_simple.js
 1, dur=1ms
 1001, dur=30ms
 2001, dur=38ms
 3001, dur=55ms
 4001, dur=75ms
 5001, dur=115ms
 6001, dur=149ms
 7001, dur=186ms
 8001, dur=236ms
 9001, dur=294ms
 10001, dur=339ms
 11001, dur=391ms
 12001, dur=435ms
 13001, dur=480ms
 14001, dur=531ms
 15001, dur=579ms
 16001, dur=627ms
 17001, dur=666ms
 18001, dur=727ms
 19001, dur=794ms

<--- Last few GCs --->

[1664:00000229A831E200]     6834 ms: Mark-sweep (reduce) 8.6 (10.8) -> 8.6 (11.8) MB, 3.5 / 0.0 ms  (average mu = 0.813, current mu = 0.005) last resort GC in old space requested
[1664:00000229A831E200]     6838 ms: Mark-sweep (reduce) 8.6 (10.8) -> 8.6 (11.8) MB, 4.3 / 0.0 ms  (average mu = 0.678, current mu = 0.002) last resort GC in old space requested

<--- JS stacktrace --->

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: 00007FF620DA058F napi_wrap+109311
 2: 00007FF620D452B6 v8::internal::OrderedHashTable<v8::internal::OrderedHashSet,1>::NumberOfElementsOffset+33302
 3: 00007FF620D46086 node::OnFatalError+294
 4: 00007FF62161153E v8::Isolate::ReportExternalAllocationLimitReached+94
 5: 00007FF6215F63BD v8::SharedArrayBuffer::Externalize+781
 6: 00007FF6214A084C v8::internal::Heap::EphemeronKeyWriteBarrierFromCode+1516
 7: 00007FF62149BC61 v8::internal::Heap::AllocateExternalBackingStore+1569
 8: 00007FF6214B5E06 v8::internal::Factory::AllocateRaw+166
 9: 00007FF6214C9824 v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArrayWithFiller+84
10: 00007FF6214C9B23 v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArrayWithMap+35
11: 00007FF6212D0C6C v8::internal::OrderedHashTable<v8::internal::OrderedHashMap,2>::Allocate+124
12: 00007FF6212D346B v8::internal::OrderedHashTable<v8::internal::OrderedHashMap,2>::Rehash+59
13: 00007FF6212D224A v8::internal::OrderedHashTable<v8::internal::OrderedHashMap,2>::EnsureGrowable+90
14: 00007FF621209737 v8::internal::interpreter::JumpTableTargetOffsets::iterator::operator=+122119
15: 00007FF621699EED v8::internal::SetupIsolateDelegate::SetupHeap+463949
16: 00007FF6216753C8 v8::internal::SetupIsolateDelegate::SetupHeap+313640
17: 0000012C6E447C5F

real    0m6.984s
user    0m0.000s
sys     0m0.000s

With the fix:

$ time node --max-old-space-size=10 ns_simple.js 
 1, dur=1ms
 1001, dur=17ms
 2001, dur=10ms
 3001, dur=10ms
 4001, dur=6ms
 5001, dur=11ms
 6001, dur=7ms
 7001, dur=5ms
 8001, dur=6ms
 9001, dur=6ms
 10001, dur=12ms
 11001, dur=7ms
 12001, dur=6ms
 13001, dur=6ms
 14001, dur=5ms
...
...
...
 886001, dur=7ms
 887001, dur=12ms
 888001, dur=7ms
 889001, dur=9ms
 890001, dur=8ms
 891001, dur=7ms
 892001, dur=11ms
 893001, dur=10ms
 894001, dur=7ms
 895001, dur=9ms
 896001, dur=13ms
 897001, dur=6ms
 898001, dur=8ms
 899001, dur=7ms

real    0m7.788s
user    0m0.000s
sys     0m0.016s
SimonX200 commented 3 years ago

I think the problem is that Map() creates a cyclic dependency between the map and the context.

So releasing the context in this.exit() will never trigger the async_hook destroy to remove the context from the map. Thats what I think.

Such a context object looks like:

12919 244 560 % 1 607 39225 %  
proto::Object@112915 245 560 % 1 607 05625 %  
map::system / Map@111615 245 720 % 2800 %  
_ns_name::"a6a29a6f-6747-4b5f-b99f-07ee96e32f88"@37301🗖 5 560 % 560 %  

properties::system@259

The cyclic dependency is the map property of a context that points to this._context in the namespace.

PhakornKiong commented 3 years ago

Have a look at the last comments from me (DavidShard) to the first answer https://stackoverflow.com/questions/66665783/memory-leak-in-cls-hooked-or-wrong-use

This is a memory leak bug in cls-hooked that can be fixed by:

30c30
<   this._contexts = new Map();
---
>   this._contexts = {};
72c72
<   let context = Object.create(this.active ? this.active : Object.prototype);
---
>   let context = {...this.active };
316c316
<         namespace._contexts.set(asyncId, namespace.active);
---
>         namespace._contexts[asyncId] = namespace.active;
326c326
<         const triggerIdContext = namespace._contexts.get(triggerId);
---
>         const triggerIdContext = namespace._contexts[triggerId];
328c328
<           namespace._contexts.set(asyncId, triggerIdContext);
---
>           namespace._contexts[asyncId] = triggerIdContext;
364c364
<       context = namespace._contexts.get(asyncId) || namespace._contexts.get(currentUid);
---
>       context = namespace._contexts[asyncId] || namespace._contexts[currentUid];
397c397
<       context = namespace._contexts.get(asyncId) || namespace._contexts.get(currentUid);
---
>       context = namespace._contexts[asyncId] || namespace._contexts[currentUid];
424c424
<       namespace._contexts.delete(asyncId);
---
>       delete namespace._contexts[asyncId];

The effect is that 900k await ns.runPromise() only leave 5000 contexts in memory.

Before node --max-old-space-size=100 was only able to execute 100k of them before the GC killed node.

Only: let context = {...this.active }; does not solve the memory leak alone. Only together with replacing the Map by a normal object the leak is gone.

The test code (ns_simple.js):


const httpContext = require('express-http-context');

async function t2(l) {
  if (l > 1) {
  }
}

async function t1() {
  let start = Date.now();
  for (let i = 0; i < 90000000; i++) {
    await httpContext.ns.runPromise(t2);
    if (i % 1000 === 0) {
      const now = Date.now();
      console.log(` ${i + 1}, dur=${now - start}ms`);
      start = now;
    }
  }
}

t1()

With node 14.15.5

time node --max-old-space-size=10 ns_simple.js

$ time node --max-old-space-size=10 ns_simple.js
 1, dur=1ms
 1001, dur=30ms
 2001, dur=38ms
 3001, dur=55ms
 4001, dur=75ms
 5001, dur=115ms
 6001, dur=149ms
 7001, dur=186ms
 8001, dur=236ms
 9001, dur=294ms
 10001, dur=339ms
 11001, dur=391ms
 12001, dur=435ms
 13001, dur=480ms
 14001, dur=531ms
 15001, dur=579ms
 16001, dur=627ms
 17001, dur=666ms
 18001, dur=727ms
 19001, dur=794ms

<--- Last few GCs --->

[1664:00000229A831E200]     6834 ms: Mark-sweep (reduce) 8.6 (10.8) -> 8.6 (11.8) MB, 3.5 / 0.0 ms  (average mu = 0.813, current mu = 0.005) last resort GC in old space requested
[1664:00000229A831E200]     6838 ms: Mark-sweep (reduce) 8.6 (10.8) -> 8.6 (11.8) MB, 4.3 / 0.0 ms  (average mu = 0.678, current mu = 0.002) last resort GC in old space requested

<--- JS stacktrace --->

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: 00007FF620DA058F napi_wrap+109311
 2: 00007FF620D452B6 v8::internal::OrderedHashTable<v8::internal::OrderedHashSet,1>::NumberOfElementsOffset+33302
 3: 00007FF620D46086 node::OnFatalError+294
 4: 00007FF62161153E v8::Isolate::ReportExternalAllocationLimitReached+94
 5: 00007FF6215F63BD v8::SharedArrayBuffer::Externalize+781
 6: 00007FF6214A084C v8::internal::Heap::EphemeronKeyWriteBarrierFromCode+1516
 7: 00007FF62149BC61 v8::internal::Heap::AllocateExternalBackingStore+1569
 8: 00007FF6214B5E06 v8::internal::Factory::AllocateRaw+166
 9: 00007FF6214C9824 v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArrayWithFiller+84
10: 00007FF6214C9B23 v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArrayWithMap+35
11: 00007FF6212D0C6C v8::internal::OrderedHashTable<v8::internal::OrderedHashMap,2>::Allocate+124
12: 00007FF6212D346B v8::internal::OrderedHashTable<v8::internal::OrderedHashMap,2>::Rehash+59
13: 00007FF6212D224A v8::internal::OrderedHashTable<v8::internal::OrderedHashMap,2>::EnsureGrowable+90
14: 00007FF621209737 v8::internal::interpreter::JumpTableTargetOffsets::iterator::operator=+122119
15: 00007FF621699EED v8::internal::SetupIsolateDelegate::SetupHeap+463949
16: 00007FF6216753C8 v8::internal::SetupIsolateDelegate::SetupHeap+313640
17: 0000012C6E447C5F

real    0m6.984s
user    0m0.000s
sys     0m0.000s

With the fix:

$ time node --max-old-space-size=10 ns_simple.js 
 1, dur=1ms
 1001, dur=17ms
 2001, dur=10ms
 3001, dur=10ms
 4001, dur=6ms
 5001, dur=11ms
 6001, dur=7ms
 7001, dur=5ms
 8001, dur=6ms
 9001, dur=6ms
 10001, dur=12ms
 11001, dur=7ms
 12001, dur=6ms
 13001, dur=6ms
 14001, dur=5ms
...
...
...
 886001, dur=7ms
 887001, dur=12ms
 888001, dur=7ms
 889001, dur=9ms
 890001, dur=8ms
 891001, dur=7ms
 892001, dur=11ms
 893001, dur=10ms
 894001, dur=7ms
 895001, dur=9ms
 896001, dur=13ms
 897001, dur=6ms
 898001, dur=8ms
 899001, dur=7ms

real    0m7.788s
user    0m0.000s
sys     0m0.016s

ns.runPromise() will cause _contexts grow by 4 for every run, so the issue you're facing is actually the rate of growth of _contexts is so high than destroy hook didn't get to run, hence the Memory Issue.

Let's simplify this, try to rerun ns_simple.js with ns.run then, add the following debug to the destroy hook of context.js

destroy(asyncId) {
      currentUid = async_hooks.executionAsyncId();
      if (DEBUG_CLS_HOOKED) {
        const triggerId = async_hooks.triggerAsyncId();
        const indentStr = ' '.repeat(namespace._indent < 0 ? 0 : namespace._indent);
        debug2(`${indentStr}DESTROY (${name}) currentUid:${currentUid} asyncId:${asyncId} triggerId:${triggerId} active:${util.inspect(namespace.active, {showHidden:true, depth:2, colors:true})} context:${util.inspect(namespace._contexts.get(currentUid))}`);
      }
      debug1(`deleteHookRan here`) // Add this line, so you can see when the destroy hook is called
      namespace._contexts.delete(asyncId);
    }
  });

// As you can't console log inside async hook
function debug1(...args) {
    process._rawDebug(`${util.format(...args)}`);
}

There is a big difference when you do await ns.run(t2) and just ns.run(t2) in your ns_simple.js

await version will not fail, while the one without await will fail.

This is because await will turn converting anything that isn't a Promise into a resolved Promise.

I still don't see how this is a memory leak issue, as you're basically overwhelming the process in a way that wasn't supposed to be used.

PhakornKiong commented 3 years ago

@SimonX200 The following might help you to visualize the difference

for (let i = 0; i < 90000000; i++) {
    const context = await session.run(t2);
    console.log(context);
    // { _ns_name: 'benchmark', id: 5 }
    // { _ns_name: 'benchmark', id: 11 }
    // { _ns_name: 'benchmark', id: 14 }

    // const context = session.run(t2);
    // console.log(context);
    // { _ns_name: 'benchmark', id: 1 }
    // { _ns_name: 'benchmark', id: 1 }
    // { _ns_name: 'benchmark', id: 1 }

    if (i % 1000 === 0) {
      const now = Date.now();
      console.log(` ${i + 1}, dur=${now - start}ms`);
      start = now;
    }
  }
SimonX200 commented 3 years ago

@Darkripper214
Actually when using async functions there will be always a await somewhere. So, for me the change fixes my/our problem. I haven't run the regression tests but I think they will also work.

That the destroy never happens when using async but never an await is fine for me.

I find it interesting that Map() creates a back-link to itself to all added objects. For me that is just another reason to use the most basic directly support features of an development eco system where possible.

I find it also interesting that Object.create(previous_context) creates such a deep object. And that the deeper the object becomes, the longer Object.create(...) takes. I wonder why it was used. Probably because of legacy node versions that did not support {...previous_context)?

Note:

There is a big difference when you do await ns.run(t2) and just ns.run(t2) in your ns_simple.js

Change your loop: const context = session.run(t2) await 1;

and it works again. So any unrelated await will allow destroy to work.

PhakornKiong commented 3 years ago

@Darkripper214 Actually when using async functions there will be always a await somewhere. So, for me the change fixes my/our problem. I haven't run the regression tests but I think they will also work.

That the destroy never happens when using async but never an await is fine for me.

I find it interesting that Map() creates a back-link to itself to all added objects. For me that is just another reason to use the most basic directly support features of an development eco system where possible.

I find it also interesting that Object.create(previous_context) creates such a deep object. And that the deeper the object becomes, the longer Object.create(...) takes. I wonder why it was used. Probably because of legacy node versions that did not support {...previous_context)?

Note:

There is a big difference when you do await ns.run(t2) and just ns.run(t2) in your ns_simple.js

Change your loop: const context = session.run(t2) await 1;

and it works again.

Yeap, there will be 2 promise generated when there is async func (Can check it via init hook).

The point is that it is never a memory leak issue, it was that the server was flooded before it can even destroy the contexts

SimonX200 commented 3 years ago

@Darkripper214 I don't agree with "that it is never a memory leak issue", because:

Why is it fixed by just changing Map to plain Object and by not using Object.create()?

This does not change anything else except the extensive linking of context's (from my point of view: unnecessary. The _contexts is the retainer or the context is on some arguments stack and retained by that. No one access the previous context through the current context).

SimonX200 commented 3 years ago
await namespace.runPromise(async () => {
    set("foo", "bar");
    const foo = get("foo");
    console.log("VAELU", foo);
  })

await namespace.runPromise(async () => {
  const foo = get("foo");
  console.log("Value", foo); // Here is prints "bar" instead of undefined
});

My changes did not fix that yet.

I see the problem in exit(). this.active is only changed if it is equal to the context to exit. Since this.active is a global it is probably changed by each async call.

Update:

The problem is the management uses asyncId as 'global uniq' while asyncId is not. It is provided by newAsyncId() and that seems not to be globally unique. The sequence of asyncId's is equal in the next ns.runPromise The newAsyncId() seems to return just the async stack depth.

Update2:

I think the use of asyncId as global unique in cls-hooked is a wrong use. In nodejs I see:

// Internal Embedder API //

// Increment the internal id counter and return the value. Important that the
// counter increment first. Since it's done the same way in
// Environment::new_async_uid()
function newAsyncId() {
  return ++async_id_fields[kAsyncIdCounter];
}

This guaranties no uniqueness. It creates a simple sequence of number.

I assume that the asyncId is not supposed to be used as global unique by node.

@Darkripper214 And this means that asyncId will collide and will corrupt httpContext.get() and httpContext.set()

We have seen that. Some deep low-level methods use httpContext.get() and they got something while running outside of any context. It was expected that undefined is returned. That's why we even put all our timers and intervals into a ns.runPromise to be able to set the httpContext values explicitly to nulls.

So multiple ongoing http requests can corrupt other http request httpContext.get()

This is a major issue

PhakornKiong commented 3 years ago

@Darkripper214 I don't agree with "that it is never a memory leak issue", because:

Why is it fixed by just changing Map to plain Object and by not using Object.create()?

This does not change anything else except the extensive linking of context's (from my point of view: unnecessary. The _contexts is the retainer or the context is on some arguments stack and retained by that. No one access the previous context through the current context).

It is definitely not a memory leak issue, as context is being cleared if ns.run() is not overwhelming the process by keeping the call stack filled.

You might want to call it as bug for your use case, which I think doesn't really make sense, since it is blocking the event loop.

await namespace.runPromise(async () => {
    set("foo", "bar");
    const foo = get("foo");
    console.log("VAELU", foo);
  })

await namespace.runPromise(async () => {
  const foo = get("foo");
  console.log("Value", foo); // Here is prints "bar" instead of undefined
});

My changes did not fix that yet.

I see the problem in exit(). this.active is only changed if it is equal to the context to exit. Since this.active is a global it is probably changed by each async call.

Update:

The problem is the management uses asyncId as 'global uniq' while asyncId is not. It is provided by newAsyncId() and that seems not to be globally unique. The sequence of asyncId's is equal in the next ns.runPromise The newAsyncId() seems to return just the async stack depth.

Update2:

I think the use of asyncId as global unique in cls-hooked is a wrong use. In nodejs I see:

// Internal Embedder API //

// Increment the internal id counter and return the value. Important that the
// counter increment first. Since it's done the same way in
// Environment::new_async_uid()
function newAsyncId() {
  return ++async_id_fields[kAsyncIdCounter];
}

This guaranties no uniqueness. It creates a simple sequence of number.

I assume that the asyncId is not supposed to be used as global unique by node.

@Darkripper214 And this means that asyncId will collide and will corrupt httpContext.get() and httpContext.set()

We have seen that. Some deep low-level methods use httpContext.get() and they got something while running outside of any context. It was expected that undefined is returned. That's why we even put all our timers and intervals into a ns.runPromise to be able to set the httpContext values explicitly to nulls.

So multiple ongoing http requests can corrupt other http request httpContext.get()

This is a major issue

I will take a look on the cyclic referencing of map as reported.

But still, the use case for this is to enable the passing of context around when the process is running, why is an auto-increment of asyncID pose an issue?

SimonX200 commented 3 years ago

But still, the use case for this is to enable the passing of context around when the process is running, why is an auto-increment of asyncID pose an issue?

Because it repeats. When debugging the example code from the issue-starter and break in before() I see duplicate asyncId's.

In the second ns.runPromise() I see the same asyncId's as in the before of the first ns.runPromise()

Either the async_id_fields get reset, or something else is happening. I have not yet found the code in the node repo that could cause that.

Update: They are initialized in AsyncWrap::Initialize. Is that called for each new async_hook object? That would mean each async_hook has its private sequence of 'uniq' asyncId's.

PhakornKiong commented 3 years ago

@SimonX200

asyncId didnt get repeated, what you see inside _contexts are just a list of past context that is a place for us too lookup. You can check the asyncID by running all the init, before , after hooks and see for yourself. It never repeats, but _contexts have a lot of repeating past context. This is the behavior that you were complaining about for the excessive usage of memory when doing some extreme loop.

init hook inside context.js

if(namespace.active) {
        namespace._contexts.set(asyncId, namespace.active);

This is to save the current context to be retrieved when before hook is called,

before hook inside context.js

currentUid = async_hooks.executionAsyncId(); \\ This will give you current execution context
//HACK to work with promises until they are fixed in node > 8.1.1
      context = namespace._contexts.get(asyncId) || namespace._contexts.get(currentUid);
// Here, we either get our specific context from `asyncId` or just take the parent from `currentUid`

So the erratic behavior that is being reported is actually caused by the way this.enter(context) and this.exit(context) is run in the before and after hook.

If you look at the this.active inside the closure of the namespace, it is supposed to be null after every ns.run() or ns.runPromise().

When it is not cleared, then the context will be passed as a prototype to the new context being created

let context = Object.create(this.active ? this.active : Object.prototype);

The core issue with this bug is actually with the feature of ns.run to pass the parent context down to the child if the key doesn't; exists in the child context. This can be solved by updating the createContext(), But this will be a breaking change for those who rely on this behaviour.

// let context = Object.create(this.active ? this.active : Object.prototype);
let context = {} \\ This should work

All in all, I think this bug is rather specific, it should only occur when there are multiple ns.run in the same lifecycle of a request. But more investigation is needed.

@Jeff-Lewis May I know your opinion on this?

SimonX200 commented 3 years ago

@Darkripper214

My assumption that different running ns.run could share the same context was wrong.

To test my assumption I extended the issue-starter's code to multiple ongoing ns.run. With that the context is never shared. Not even at first.

const httpContext = require('express-http-context');
const FIELD = 'foo';
function sleep(ms) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}
function checkAndSet(val) {
  const gval = httpContext.get(FIELD);
  if (gval !== val) {
    if (gval) {
      console.error('change detected', gval, ' expected ', val);
    }
    httpContext.set(FIELD, val);
  } else {
    console.info('ok', val);
  }
}
async function t1() {
  const val = 'abc0';
  await sleep(500);
  while (true) {
    checkAndSet(val);
    await sleep(1000);
  }
}
async function t2() {
  const val = 'xyz1';
  while (true) {
    checkAndSet(val);
    await sleep(1000);
  }
}

async function tx() {
  httpContext.ns.run(t1);
  httpContext.ns.run(t2);
}

tx();
$ node test_ns2
ok xyz1
ok abc0
ok xyz1
ok abc0
ok xyz1
ok abc0

Now regarding my problem:

To guarantee that our project core functions get proper values from httpContext.get() when execute outside of http request we use the await httpContext.ns.runPromise(async () => { httpContext.set(...); await actual_function(); })

We have 3 10 seconds interval timer and other minute intervals running. These functions need like 1-2 seconds until their data were returned from database and were processed.

All in all, I think this bug is rather specific, it should only occur when there are multiple ns.run in the same lifecycle of a request. I am quite sure that this usage pattern is quite common.

Without the changes I posted above the heap grows by 24MBytes per day with only one http request every few minutes.

After 7 hours there are already the mentioned 60k contexts and about 8MBytes. Not sure where the 60k contexts come from. Every 10s for 7h should result in 3x2520 await ns.runPromise() calls.

Update: Just now I found a wrong configured 10s timer that was running in 10ms. That 10ms plus the db delay can result in 60k ns.runPromise() calls in 7h.

Update2: My mistake. Its not 10ms but 25s. So the 60k contexts in 7h are still not clear.

Update3: The applicationinsights uses cls-hooked as well. The timers all do db operations so the http requests to the db are counted by applicationinsights

Since very new createContext gets slower by the number of previous createContext calls the whole applications gets slower. After 60k createContext each createContext needs 2.2ms (on a 5950x at 4.9GHz, single thread)

This does not only affect the timers but also every ns.run by a http request.

For us the increase of 24MBytes per day for nothing is a major issue because our webapp runs continuously as Azure WebApp with limited memory. We already increase the memory from 1.7G to 3G by doubling the Azure costs.

We have also noticed that our webservice becomes slower and slower the longer it runs.

We investigated in searching for memory leaks in our software and to optimize our code.

The current solution is that the 10 seconds timers are not using any express-http-context and to use just one httpContext.set() with an object that stores the needed data. The references inside that object we reset after every http request and timer before we set the httpContext key to null itself.

Fortunately the 10 seconds timers actually it didn't needed a httpContext. Here we were lucky and our application is already running much smoother and without a memory leak.

The minute timers still have to use httpContext.

Update I can estimate the current memory loss per day with 3 minute timers (=1440*3) and no http request:

Instead of 24MByte per day now 589kByte.

The performance degeneration of the mentioned 2.2ms (on a 5950x at 4.9GHz (single thread), Azure WebApps are running on some old Xeon, so multiply that by 10) per ns.run is reached after 15 days.

After 30 days each ns.run will need at least 11ms (on a 5950x at 4.9GHz)

 132001, dur=9973ms
 133001, dur=10160ms
 134001, dur=10251ms
 135001, dur=10479ms
 136001, dur=10691ms
 137001, dur=10889ms
 138001, dur=11134ms
 139001, dur=11362ms

Btw: it makes no difference if I call an explicit global.gc() and do sleeps between await ns.runPromise(). Its just about the number of createContext calls. And there are no overlapping ns.runPromise(). The code is already in a comment above.

I understand know why WebApp Instances are usually restarted after a time. In Azure Webapps you have to activate "always on" explicitly.

Update2: The 10ms timer was a 25s timer. Got lost in our project code

Update3: We are using applicationInsights and that uses cls-hooked to track also http-requests we send. That are the database queries. So the number of 140k contexts is reached much earlier than 30 days. We cannot deactivate applicationInsights as it is essential for our platform monitoring. The memory loss of 24MByte in at least 15 days may not be critical. The performance drop however is absolut critical. We probably have to restart our WebApp Instances once per day

PhakornKiong commented 3 years ago

@SimonX200

Can you show how your team is doing the following?

To guarantee that our project core functions get proper values from httpContext.get() when execute outside of http request we use the await httpContext.ns.runPromise(async () => { httpContext.set(...); await actual_function(); })

I still don't get it how ns.run will cause degradation. Do note that it is developer responsibility to ensure that there are no object referencing the context after it is not being used (So GC can clear everything). The other way is to come up with a way to clear all context within ns.run.

Nonetheless, ns.runPromise is definitely having some problem due to the way enter(context) and exit(context) interact with _set.

I'm thinking that the _contexts and _set could use an object with keys, then directly find the right context using the asyncID executionID and `triggerID. That will solve the overgrowth of _contexts in current implementation. But there it will definitely cost some performance issue.

SimonX200 commented 3 years ago

@Darkripper214

The following most simple code shows the degradation perfectly.

const httpContext = require('express-http-context');

const END = 120000;

function sleep(ms) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

async function t1(l) {
  if (l > 1) {
  }
}

async function tx() {
  let prev_avr_delta = 0;
  let avr_delta_sum = 0;
  let avr_cnt = 0;
  for (let loop = 0; loop < END; loop++) {
    const start = Date.now();
    await httpContext.ns.runPromise(t1);
    const delta = Date.now() - start;
    avr_delta_sum += delta;
    avr_cnt++;

    if (loop % 1000 === 0) {
      const avr_delta = avr_delta_sum / avr_cnt;
      console.log(`loop ${loop}, avr per runPromise: ${avr_delta.toFixed(3)} ms`);
      prev_avr_delta = avr_delta;
      avr_cnt = 0;
      avr_delta_sum = 0;
    }

    if (loop % 100 == 0) {
      await sleep(10);
    }
  }
}

tx();

This is the most basic use of ns.runPromise() with the most basic function (that shows the problem) being called.

You can play around with global.gc() or longer/more sleeps without any change.

In the end a single ns.runPromise() needs 7.8ms for doing nothing on a 5950x.

loop 115000, avr per runPromise: 7.131 ms
loop 116000, avr per runPromise: 7.275 ms
loop 117000, avr per runPromise: 7.405 ms
loop 118000, avr per runPromise: 7.519 ms
loop 119000, avr per runPromise: 7.811 ms

On a Xeon E5-2630v3 a single ns.runPromise() needs 21ms


loop 113000, avr per runPromise: 19.199 ms
loop 114000, avr per runPromise: 19.385 ms
loop 115000, avr per runPromise: 19.815 ms
loop 116000, avr per runPromise: 20.950 ms
loop 117000, avr per runPromise: 20.863 ms
loop 118000, avr per runPromise: 20.632 ms
loop 119000, avr per runPromise: 20.916 ms

Note: The same behavioral is also achieved when collecting the promises and doing a await Promise.all() ever 100 loops.

SimonX200 commented 3 years ago

@Darkripper214 Some notes to the cls-hooked code:

The _set is actually a _stack.

Update: Deleted the stuff regarding async_hook. Everything is fine

PhakornKiong commented 3 years ago

@SimonX200

I still cant reproduce the result on Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz 2.60 GHz. Performance-wise, it shouldn't vary that much.

I even remove the sleep(10) and run with -max-old-space-size=25.

This is the result I got.

loop 16294000, avr per runPromise: 0.008 ms
loop 16295000, avr per runPromise: 0.008 ms
loop 16296000, avr per runPromise: 0.008 ms
loop 16297000, avr per runPromise: 0.008 ms
loop 16298000, avr per runPromise: 0.008 ms
loop 16299000, avr per runPromise: 0.008 ms
loop 16300000, avr per runPromise: 0.039 ms \\ This pop up every once in a while, GC stuff running?
loop 16301000, avr per runPromise: 0.007 ms
loop 16302000, avr per runPromise: 0.008 ms
loop 16303000, avr per runPromise: 0.006 ms
loop 16304000, avr per runPromise: 0.008 ms
loop 16305000, avr per runPromise: 0.008 ms
loop 16306000, avr per runPromise: 0.007 ms
loop 16307000, avr per runPromise: 0.007 ms
loop 16308000, avr per runPromise: 0.009 ms
loop 16309000, avr per runPromise: 0.008 ms
loop 16310000, avr per runPromise: 0.007 ms

Yeah, I know that _set in current implementation is a stack. The issue with the implementation is relying on the sequence of async_hook to get to its previous context, which I don't really know when will this behavior be useful.

The implementation is sort of messy especially when there is any complex async operation (Try opening a socket or do a fetch with async_hook and you will see). The other way to fix this might be to limit the enter(context) and exit(context) to only PROMISE type. But this would involve having another data structure to keep track of asyncID to type of Promise

SimonX200 commented 3 years ago

@Darkripper214 Hmm,

{
  "engines": {
    "node": ">= 12.14.1",
    "npm": ">= 6.13.4"
  },
  "dependencies": {
    "express-http-context": "^1.2.4"
  }
}
npm list
`-- express-http-context@1.2.4
  +-- @types/cls-hooked@4.3.3
  | `-- @types/node@14.14.35
  +-- @types/express@4.17.11
  | +-- @types/body-parser@1.19.0
  | | +-- @types/connect@3.4.34
  | | | `-- @types/node@14.14.35 deduped
  | | `-- @types/node@14.14.35 deduped
  | +-- @types/express-serve-static-core@4.17.19
  | | +-- @types/node@14.14.35 deduped
  | | +-- @types/qs@6.9.6 deduped
  | | `-- @types/range-parser@1.2.3
  | +-- @types/qs@6.9.6
  | `-- @types/serve-static@1.13.9
  |   +-- @types/mime@1.3.2
  |   `-- @types/node@14.14.35 deduped
  `-- cls-hooked@4.2.2
    +-- async-hook-jl@1.7.6
    | `-- stack-chain@1.3.7
    +-- emitter-listener@1.1.2
    | `-- shimmer@1.2.1
    `-- semver@5.7.1

node -v v14.15.5

I see the same behavior also with node v14.16.0

Update: and with node v15.12.0 and with node v12.21.0

I also did it for the Xeon test on a complete new environment with fresh installed node.

Update2: With my proposed changes (replace of Object.create and new Map to Object) I can see results you see.

Update3: Installed node 14.16.0 and nodepad++ portable, on my private laptop and c&p code from this thread. Same bad results. After 25000 loops its at 7ms per await ns.runPromise()

Update4: Run it on linux with node v14.16.0 and see the same results. (npm i; node ns.js)

SimonX200 commented 3 years ago

The only thing that must be changed for fix performance and memory leak is:

diff --git a/context.js b/context.js
index 1fad2cb..9ad4472 100644
--- a/context.js
+++ b/context.js
@@ -69,7 +69,7 @@ Namespace.prototype.get = function get(key) {

 Namespace.prototype.createContext = function createContext() {
   // Prototype inherit existing context if created a new child context within existing context.
-  let context = Object.create(this.active ? this.active : Object.prototype);
+  let context = { ...this.active };
   context._ns_name = this.name;
   context.id = currentUid;

@@ -473,5 +473,3 @@ function debug2(...args) {
     return fn.constructor.name;
   }
 }*/
-
-

I added a memory test and a performance test to the mocha tests.

However I cannot push the branch to open a pull request.

PhakornKiong commented 3 years ago

@SimonX200 I'm using node v15.0.1 for all the test. Not sure why there is such huge divergence in the result.

I assure you that by replacing _contexts from map to object did little to improve on the memory issue. I've done some test and _contexts is still having the same behavior of appending the same context (with different asyncId but same executionAsyncId).

As far as I'm concern, there are two issues with the current implementation 1) Excessive Memory usage from having the same context with different asyncID appended into _contexts,It is not a severe bug as long nothing is blocking destroy hook from running. But of course, we should improve if it is possible.

2) Incorrect behaviour of runPromise, which is explained in my comment above. This is a severe bug and should be fix ASAP. https://github.com/Jeff-Lewis/cls-hooked/issues/63#issuecomment-803076914

SimonX200 commented 3 years ago

@Darkripper214

I prefer minimal changes to prevent side effects. Also the performance test runs consistently 100ms (of 16850ms) or 0.6% faster with Map than with Object. Map however needs a few percent more memory.

I have cloned the repo and pushed my change into that one. Have a look at the tests memory-leak.test.js and performance-leak.test.js. The memory-leak test requires a max-old-space-size of at most 25M which is set in package.json for npm test.

https://github.com/SimonX200/cls-hooked

Also I have tried node v15.0.1 and I still see the performance problem when running the performance tests against the original code. I use nvm to install and switch node versions. But I have tried it also with plain node 14 lts installations and a development free laptop and on a linux-vm with a fresh node 14.16.

When I enable the console.log in the performance-leak.test.js and the context.js from the master then I see:

$ node -v
v15.0.1
$ ./node_modules/mocha/bin/mocha test/performance-leak.test.js 

  cls performance leak
1 0 0.02 1.02
2 0 0.02 1.02
3 0 0.03 1.02
4 0 0.03 1.02
5 0 0.02 1.02
6 0 0.03 1.02
7 0 0.03 1.02
8 0 0.03 1.02
9 0 0.04 1.02
10 0 0.05 1.02
11 0 0.04 1.02
12 0 0.05 1.02
13 0 0.04 1.02
14 0 0.05 1.02
15 0 0.07 1.02
16 0 0.05 1.02
17 0 0.05 1.02
18 0 0.06 1.02
19 0 0.07 1.02
20 0 0.07 1.02

..
..
..

346 11 1.24 1.02
347 11 1.25 1.02
348 11 1.25 1.02
349 12 1.27 1.02
350 12 1.27 1.02
351 13 1.28 1.02
352 13 1.27 1.02
353 13 1.28 1.02
354 14 1.29 1.02
355 14 1.29 1.02
    1) Execute ns.runPromise

  0 passing (24s)
  1 failing

  1) cls performance leak
       Execute ns.runPromise:

      AssertionError: expected 15 to be below 15
      + expected - actual

      at Context.<anonymous> (test\performance-leak.test.js:47:34)

The same I see in the linux-vm.

BTW: Was just running the performance test in a wsl2 debian linux on windows and there the performance test with my code completes within 2500ms (for 100000 ns.runPromise). In windows on the same PC it needs 16800ms. The Object is still consistently 100ms slower than the Map (2.6s against 2.5s). Looks like a fixed time offset.

Update:

When changing my t2 function from

async function t2(l) {
  if (l > 1) {
  }
}

to

async function t2(l) {
  // if (l > 1) {
  //}
}

or

async function t2(l) {
  ns.set('a', 1);
  // if (l > 1) {
  //}
}

Then there is no performance problem.

I don't know why a used undefined argument has this effect but this is consistent as well.

Update2:

It happens always when the function called through ns.runPromise has an optional argument which is used somewhere.

async function t3(l) {
  l++;
}

async function t2(l = 0) {
  await t3(l);
}

Most likely something is traversing the 10000 links from the current context to the first one that were established by Object.create(this.active)

2021-03-21 22_52_10-context js - cls-hooked - Visual Studio Code

Interesting that the id difference is always 4.

Update3:

Ok, its clear. Its because the context is put onto the argument stack. So it becomes "l". And since the context is degenerated by its 10000 of links via "proto" any use of it is a killer. So any use of the context via the argument or via namespace.active kills the performance.

async function t2() {
  if (ns.active > 1) {
  }
}

That this is the case because node must determine if ns.active has a method toNumber() (or so).

And because the current context is linked to almost all previous context's it takes some time to traversal through all proto .

Interesting: With a function

async function t2() {
  const l=1;
}

The distance between the contexts is eratic. 2021-03-21 23_13_43-performance-leak test js - cls-hooked - Visual Studio Code

And with my fix the context is still copied from the current active context (btw: still shallow copy as before) but without the proto link to the context object.

2021-03-21 23_17_52-performance-leak test js - cls-hooked - Visual Studio Code

And that is the reason why also the memory leak is fixed. Because the current context retains all previous contexts.

The _contexts map however shrinks and grows normally also with the original code. So the map was not the cause of the leak.

Update4:

The deep linking of the current active context to most of its predecessors is also a problem for a ns.get('some-unset-key') because Object.create() does not copy the attributes but links the objects. At runtime this means that a property is searched through the whole chain.

A httpContext.get('optional-value') will take 20ms on a Xeon after e.g. 100k ns.runPromise() (about 25k contexts are linked).

async function t2() {
  const x = ns.get('something-not-set');
}

Fortunately we already set all our httpContext variables upfront (and reset them at the request end) and now with just one ns.set()

And since ns.active is a global it is random what context is copied when there are multiple active contexts used in asynchronous processing of e.g. multiple http-requests.

PhakornKiong commented 3 years ago

@SimonX200 You're right on the traversal of __proto__. This issue will be easily solved by

Replacing let context = Object.create(this.active ? this.active : Object.prototype); with let context = { ...this.active }

This would make the context more flat and would solve the case where deep traversal is hurting performance.

But would this solve the memory issue you're facing in your project?

SimonX200 commented 3 years ago

@Darkripper214

But would this solve the memory issue you're facing in your project?

Yes, it does.

The memory-leak.test.js (executed with --max-old-space-size=20) shows that too.

Because ns.active is random when not being inside an async function (which is protected by async_hook) also the sync ns.run() in the express-http-context middleware could increase the depths of the contexts.

BTW: The soft linking of the previous context to the current one means that changes of the outer context after the ns.run() will be visible in the inner context. Is that a feature?

SimonX200 commented 3 years ago

Also ns.runPromise always leaves an ns.active != null

      for (let j = 0; j < INNER_LOOPS; j++) {
        await ns.runPromise(t2);
        while (ns.active) {
          await sleep(10);
        }

The while loop never ends. This happens already after the first finished await ns.runPromise().

SimonX200 commented 3 years ago

For our project we have decided to switch to plain AsyncLocalStorage.

It provides everything we need and should be stable. I will adapt my performance and memory leak test to it to see if it is really so.

Once I verified that is it stable I will suggest to express-http-context to switch to AsyncLocalStorage as express-http-context is what currently most webserver use.

PhakornKiong commented 3 years ago

@SimonX200

You can take a look at this https://github.com/Darkripper214/ALS-Context

My aim is to implement context library with consistent implementation with AsyncLocalStorage. You can take a look at the test file.

I'm planning to add a different flavor of cls-hooked that is more in-line with AsyncLocalStorage, with all the bindemitter stuff for AsyncLocalStorage to work with EE.

SimonX200 commented 3 years ago

@Darkripper214

While investigating more into AsyncLocalStorage I came across this:

https://github.com/nodejs/node/issues/34493

We too have countless async that are not really waiting.

I found a simple and elegant way to keep a local storage using the V8 call stack-trace by constructing a function inside run() with the context id being coded into its name. No async_hooks and so only the performance of set()/get() is impacted and that will not impact the project performance.

Very few code lines and works out-of-the-box.

The only theoretical problem is when there are dangling promises still running after the outer run has finished(). They will lose the access to the context once the outmost run() has finished as it deletes the context from the store (Object).

ridakk commented 3 years ago

any update on this issue, or it this repo still maintained?

SimonX200 commented 3 years ago

Seems to be not maintained anymore.

We switched to native AsyncLocalStorage.

image

Recently I was considering to open an issue on express-http-context to notify the community about the memory leak of web servers that are not restarted periodically and the wrong context get's of generic project functions outside of http-requests.

Both is caused by this issue.

oleg-codaio commented 3 years ago

I'm running into this issue too with nested runPromise calls - it seems like it's not exiting the context as expected. The "workaround" might be to just use separate namespaces.

mingo023 commented 2 years ago

Just come across this issue, too. Changed to using runAndReturn instead, and that seems to have fixed it for me. Should just be a literal swap out since it will be returning the promise of your async methods

It works, thank you a lot!