elastic / apm-agent-nodejs

https://www.elastic.co/guide/en/apm/agent/nodejs/current/index.html
BSD 2-Clause "Simplified" License
576 stars 224 forks source link

Investigation about async-hooks #1697

Open SimonSimCity opened 4 years ago

SimonSimCity commented 4 years ago

In a script of mine, where I heavily use use callbacks, seems to get an extremely high spike of calls of the function destroy in the async-hooks file. https://github.com/elastic/apm-agent-nodejs/blob/master/lib/instrumentation/async-hooks.js#L103

Screen Shot 2020-03-30 at 10 25 04 AM Screen Shot 2020-03-30 at 12 58 01 PM

All the green spans are for internal/async_hooks.js, and the blue below is the mentioned destroy function.

For me it's now a question if we should not remove this async monitoring all together. It uses a lot of computation power which could be used for the actual task.

SimonSimCity commented 4 years ago

Just FYI, I think it's extremely important for this issue to have a deep understanding of the event loop:

The callbacks, set by async-hooks are called as process.nextTick() - so they are executed immediately after the callback requesting it finished execution - according to the last slide on the youtube video and the description about async-hooks (https://nodejs.org/api/async_hooks.html#async_hooks_asynchronous_context_example).

AndrewGrachov commented 4 years ago

Can confirm this cause problems on our production too, about 30% of process CPU time

Screenshot

I suggest this code

      if (asyncIds) {
        const index = asyncIds.indexOf(asyncId)
        asyncIds.splice(index, 1)
      }

has O(N) complexity and slows things down

trentm commented 3 years ago

I realize this is a year later, apologies for that.

Here is a first (possibly) naive patch to the agent that changes to 0(N) complexity to O(1):

diff --git a/lib/instrumentation/async-hooks.js b/lib/instrumentation/async-hooks.js
index 1dd168ff..df2a1b0a 100644
--- a/lib/instrumentation/async-hooks.js
+++ b/lib/instrumentation/async-hooks.js
@@ -77,12 +77,12 @@ module.exports = function (ins) {
     activeTransactions.set(asyncId, transaction)

     // Track the context by the transaction
-    let asyncIds = contexts.get(transaction)
+    const asyncIds = contexts.get(transaction)
     if (!asyncIds) {
-      asyncIds = []
-      contexts.set(transaction, asyncIds)
+      contexts.set(transaction, new Set([asyncId]))
+    } else {
+      asyncIds.add(asyncId)
     }
-    asyncIds.push(asyncId)

     const span = ins.bindingSpan || ins.activeSpan
     if (span) activeSpans.set(asyncId, span)
@@ -103,12 +103,10 @@ module.exports = function (ins) {
   function destroy (asyncId) {
     const span = activeSpans.get(asyncId)
     const transaction = span ? span.transaction : activeTransactions.get(asyncId)
-
     if (transaction) {
       const asyncIds = contexts.get(transaction)
       if (asyncIds) {
-        const index = asyncIds.indexOf(asyncId)
-        asyncIds.splice(index, 1)
+        asyncIds.delete(asyncId)
       }
     }

However, in my testing (load testing a demo app and looking at a flamegraph), my changes don't make an effective change in performance on what I'd imagine is a typical application.

To try to create an extreme case, I have a demo express app with an endpoint handler like this:

function deep_callstack(req, res, next) { // my handler
    asyncRecurseN(1000, function () {
      setTimeout(function () {
        res.send('deep callstack done')
      }, 1)
    })
  }

 function asyncRecurseN(n, cb) {
  if (n <= 0) {
    process.nextTick(cb)
    return
  } else {
    process.nextTick(asyncRecurseN, n-1, cb)
  }
} 

Putting that under load gen and getting a flamegraph shows that I can reduce the amount of time that "destroy" async_hook method is top-of-stack. However, even in this extreme case (over 1000 async ops in the context of this one transaction with no spans) the overall app improvement was only on the order of a few percentage points.

If anyone still has this issue, or data on what condition triggered this issue (millions of async ops for a single APM transaction or span??), I'd be very interested to hear.