Open blimmer opened 4 years ago
For context, here is a bit of history behind this issue since it's been problematic for a while now.
Historically, context propagation APIs for Node have been using monkey patching on asynchronous code to track the asynchronous context. This approach has been used successfully by APM vendors for many years, even though it didn't support native code and could lose the context in some cases.
When async/await
landed, monkey patching was no longer an option so everyone had to switch to async_hooks
which is the official construct provided by Node to track the asynchronous context. Even though the module is still experimental to this day, it's the only way currently to do this. This is why it's always been the default implementation used by our scope manager in versions of Node that support it.
The main issue with async_hooks
is that it's very heavy when used with promises. This is because the promise hooks exposed by v8 require a jump from JavaScript to native and back for every promise. There has been work done in the past few months to make async_hooks
significantly lighter when a destroy
hook is not used, for example when using AsyncLocalStorage
, but there is still a fair bit of overhead when a lot of promises are created in quick successions. This means that even though we have a new scope manager that no longer relies on the destroy
hook, we still incur this overhead.
Since the overhead comes from Node and not from the tracer, and we have to use this feature to support automatic context propagation, this is unfortunately an overhead we can't realistically avoid at the tracer level. It's also worth noting that in most cases, the overhead is low enough that it's not problematic.
In order to fix this completely, we'll have to fix async_hooks
in Node itself, and also the corresponding promise hooks in v8. This is not small task and will take a while, but we'll update this issue as progress is done.
Progress on the Node side can be tracked in https://github.com/nodejs/node/pull/36394
In Node.js v16.2.0, significant performance improvements in async_hooks
are present, so if you can test with this version of Node.js, hopefully that alleviates a lot of the overhead.
@bengl I dont think there will be any improvement, even worse, I assume there will be some degradation according to this PR https://github.com/nodejs/node/pull/36394 when using destroy hook (you are using it), there is some small degradation in performance
Before:
$ ./node-master benchmark/async_hooks/promises.js
async_hooks/promises.js asyncHooks="enabled" n=1000000: 348,189.5714361237
async_hooks/promises.js asyncHooks="enabledWithDestroy" n=1000000: 212,124.80165853762
async_hooks/promises.js asyncHooks="enabledWithInitOnly" n=1000000: 391,763.36348583025
async_hooks/promises.js asyncHooks="disabled" n=1000000: 1,670,456.5232560865
After:
$ ./node benchmark/async_hooks/promises.js
async_hooks/promises.js asyncHooks="enabled" n=1000000: 1,153,166.7820469614
async_hooks/promises.js asyncHooks="enabledWithDestroy" n=1000000: 207,516.71584447258
async_hooks/promises.js asyncHooks="enabledWithInitOnly" n=1000000: 1,043,624.1457311497
async_hooks/promises.js asyncHooks="disabled" n=1000000: 1,603,204.3180576847
Here u set the destroy hook https://github.com/DataDog/dd-trace-js/blob/master/packages/dd-trace/src/scope/async_hooks.js#L37
Here u set the destroy hook
The async_hooks
scope manager is only used in old versions of Node. Any recent versions such as latest 12.x, 14.x or 16.x use instead the async_resource
scope manager which only uses init.
Fwiw today we tried bumping to node 16.2.0 to get this perf fix, and are seeing errors in our app that does this ~admittedly somewhat esoteric pattern with async_hooks
:
import { AsyncLocalStorage } from "async_hooks";
...
export const currentFlushSecret = new AsyncLocalStorage<{ flushSecret: number }>();
....
currentFlushSecret.run({ flushSecret: this.flushSecret }, async () => {
// ...call various methods that do...
const { flushSecret } = currentFlushSecret.getStore() || {};
// ...and ensure flushSecret is the expected value
if (flushSecret !== this.flushSecret) {
throw new Error("invalid usage detected");
}
});
Where the idea is that only code that is specifically ran within the run
callback is allowed to touch/mutate the this
/ EntityManager
instance, and all others mutations will be rejected. Basically we're using it as a hacky thread-local-ish mutex/lock.
Previously this "invalid usage detected" wasn't hit in our application's code, but it is being hit now in our test suite, either insinuating that a) our application code had a bug that node 16.1.0 wasn't catching because this AsyncLocalStorage
was propagating a flushSecret
when it wasn't supposed to, or b) node 16.2.0 has a regression where AsyncLocalStorage
is not propagating flushSecret
(to one of the methods invoked directly in, or transitively from, the run
callback) when it is supposed to.
The ^ code is part of an open source project:
https://github.com/stephenh/joist-ts/blob/main/packages/orm/src/EntityManager.ts#L202
Although our currently failing tests are in an internal project. If we run the joist-ts public tests on node 16.2.0, they work just fine (and we do have tests that specifically exercise the flush secret behavior), so we don't have a repro yet.
We're going to work more tomorrow on getting a repro done, in the public joist project, and verifying whether it is truly a regression in 16.2.0 or else just a bug/misuse of AsyncLocalStorage
(in this joist project) or a big/misuse in our application itself (that is now getting correctly detected).
So, I dunno, this is not a super-useful update yet, but mostly wanted to post as an FYI in case anyone on the datadog/nodejs side of things would look at the AsyncLocalStorage
usage and immediately intuit "oh yes we're doing something wrong" or "oh shoot something in the node 16.2.0 change over looked this".
cc @Qard who would have the most insight to determine at a glance if there could be a regression causing this issue.
Okay @rochdev and @Qard , we believe we have a reproduction in this PR:
https://github.com/stephenh/joist-ts/pull/122/commits
Notice that it passes on 16.1.0:
But fails on 16.2.0:
A few notes:
--detectOpenHandles
to jest--inspect-brk
(i.e. debugging the issue in webstorm/etc) makes the issue go away and the tests passAuthor.test.ts
assumes you run the make db
in the packages/integration-test
folder to get the postgres db setup and running. Also a yarn
+ yarn build -w
in the top-level directly is useful. Specifically we log the intent to do an AsyncLocalStorage.run
:
console.log(`Starting new AsyncLocalStorage.run with flushSecret=${this.flushSecret}`);
On the very first line within that run
callback, we call getStore()
to look up our local storage value:
console.log(`Inside AsyncLocalStorage.run, flushSecret=${b}`);
And this works, it returns 1
. We then do a few await ...helper methods...
and then, within the same run
callback that did the "Inside" log, do another console.log
:
console.log(`Inside AsyncLocalStorage.run, flushSecret=${a}`);
And this one is broken, we get back undefined
instead of 1
.
So, at some point between line 715 and line 731, we have lost track of the AsyncLocalStorage
async resource.
Happy to chat more, but does this give enough direction to reproduce on your side? Apologies that this project doesn't have the best "Getting Started"/etc documentation, so let me know if there are any gotchas in getting in running.
(Also, technically this is likely an issue in node itself, as this joist open source project doesn't use ddtrace, we just happened to notice the issue while getting our internal project that uses both ddtrace and joist onto node 16.2.0, to leverage the ddtrace perf wins in this ticket. Would you like us open a new issue against nodejs directly? Happy to let you guys do that. Disclaimer assuming we're not doing something dumb on our side.)
Thanks!
@stephenh I'm able to reproduce with the project you've shared, but without being able to debug and with a codebase I don't know it's difficult to try to find the issue. Do you think you'd be able to extract only a small part of the code that would still reproduce the issue? Or maybe find the exact line where the context is lost? I'm not super familiar with Jest either which definitely doesn't help 😅
@rochdev that's a fair ask! I think we were so pleased with ourselves to having isolated the repro that we didn't push farther to make it as simple as possible. :-)
It is terribly ironic you mention jest, because that might be the issue...
I've created a minimal ~10 lines of code repro here:
https://github.com/stephenh/async-local-storage-repro
I'll defer to the readme for steps, but basically if you run code that does: 1) access AsyncLocalStorage.getStore
2) await
a promise, 3) access AsyncLocalStorage.getStore
again, on node 16.2.0 and jest, the getStore
value is lost and comes back as undefined
.
The matrix is:
Possible cause of the new issue since 16.2.0: https://github.com/nodejs/node/issues/38781
It looks like a fix should land in Node 16.3.0.
Closing as this seems to be resolved. If you disagree or there's anything I missed, let me know and I can reopen.
Sadly, I think this is still a problem with async_hooks
in general. I tested a particularly slow/complex query on a client's backend in a few various scenarios (see also #2385). Node 18.13.0 in docker on MacOS.
JIT enabled? | dd-trace enabled? | Average response time over 50 requests |
---|---|---|
√ | X | 4064.92 ms |
X | X | 4863.24 ms |
√ | √ | 5898.32 ms |
X | √ | 8325.02 ms |
When JIT is enabled, it significantly reduces the number of spans in our traces. As you can see from the results, enabling dd-trace
makes our requests significantly slower.
I was talking with @rochdev and he had me add a no-op async_hook
to try to separate dd-trace
from general async_hook
overhead. The hook looked like this:
import { createHook } from "node:async_hooks";
createHook({ init() {}, before() {}, after() {}, destroy() {} }).enable();
Just having an async_hook
without dd-trace
set up increased the average response time from 4064.92 ms
to 7994.8 ms
. So, even though dd-trace
is adding some additional overhead, the vast majority of the problems comes from async_hooks
.
It seems like maybe the async_hooks
approach might be a dud, since they're so unperformant?
Yep, for exactly this reason there's been a bunch of discussion in Node.js core about a more performant replacement for AsyncLocalStorage which does not use async_hooks internally. There's unfortunately no alternative currently, so your option is accept some overhead or have no observability. It's a lot less overhead than it used to be, but can still be noticeable depending on how heavily an application uses promises. I'm hopeful that we'll have something much better in future Node.js versions though. :)
Discussion about the above can be found in https://github.com/nodejs/node/issues/46265
Describe the bug
We noticed a significant increase in response times for GraphQL queries that produce large payloads after we enabled
dd-trace
for our project.Here's what we've seen:
scope
Settingdd-trace
dd-trace
async_hooks
async_local_storage
async_local_storage
async_local_storage
noop
async_resource
branchasync_resource
const { createHook } = require('async_hooks'); createHook({ init() {} }).enable();
As you can see, it looks like
async_hooks
are the primary reason we see the slowdown, and not necessarily anything specific todd-trace
.The logic that triggers this especially pronounced behavior uses the following:
async/await
vs.Promise
The query itself produces many promises (I'm unsure of the exact number, but probably in the thousands), which we believe is exacerbating the problem.
I was working synchronously with @rochdev and @stephenh on this issue via the Datadog Slack. He mentioned that he'll add some additional detail to this issue.
Environment
node:14.8.0
image running on macOS host