Closed felixbarny closed 2 years ago
- Can we do this in a minor?
Lacking some strict process for deciding, my vote would be "yes". I would be pretty surprised if the change broke a user. They might be surprised that some spans are no longer showing a stacktrace in APM UI, but not broken. The performance benefit for Node.js apps with any significant load will be clear.
we are still doing
Error.captureStackTrace(obj, Span)
on every span creation
Could you do that at the end of a span, after the duration is known? Or would stack traces that are taken on span end be useless because of the async nature of Node.js?
Could you do that at the end of a span, after the duration is known? Or would stack traces that are taken on span end be useless because of the async nature of Node.js?
The stack traces would often be useless. For a small Redis example script ("trace-redis.js") that does a GET key1
:
const apm = require('elastic-apm-node').start({
serviceName: 'trace-redis'
})
const redis = require('redis')
const client = redis.createClient()
const t1 = apm.startTransaction('t1')
client.get('key1', function (err, reply) {
console.log('GET key1: %s', err ? `${err.name}: ${err.message}` : reply)
t1.end()
client.quit()
})
Here is the stack at Span start, and at Span end:
% node trace-redis.js
XXX stack at Span start Error: here
at new Span (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/span.js:46:43)
at Transaction.createSpan (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:146:10)
at Instrumentation.createSpan (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/index.js:435:27)
at RedisClient.wrappedInternalSendCommand [as internal_send_command] (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/modules/redis.js:73:35)
at RedisClient.get (/Users/trentm/tmp/myapp/node_modules/redis/lib/commands.js:46:25)
at Object.<anonymous> (/Users/trentm/tmp/myapp/trace-redis.js:8:8)
at Module._compile (internal/modules/cjs/loader.js:999:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:1027:10)
at Module.load (internal/modules/cjs/loader.js:863:32)
at Function.Module._load (internal/modules/cjs/loader.js:708:14)
at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:60:12)
at internal/main/run_main_module.js:17:47
XXX stack at Span#end Error: here
at Span.end (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/span.js:72:41)
at Command.<anonymous> (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/modules/redis.js:49:12)
at /Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/run-context/AbstractRunContextManager.js:76:49
at AsyncHooksRunContextManager.with (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/run-context/BasicRunContextManager.js:49:17)
at Command.wrapper [as callback] (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/run-context/AbstractRunContextManager.js:76:23)
at normal_reply (/Users/trentm/tmp/myapp/node_modules/redis/index.js:655:21)
at RedisClient.return_reply (/Users/trentm/tmp/myapp/node_modules/redis/index.js:753:9)
at JavascriptRedisParser.returnReply (/Users/trentm/tmp/myapp/node_modules/redis/index.js:138:18)
at JavascriptRedisParser.execute (/Users/trentm/tmp/myapp/node_modules/redis-parser/lib/parser.js:544:14)
at Socket.<anonymous> (/Users/trentm/tmp/myapp/node_modules/redis/index.js:219:27)
at Socket.emit (events.js:314:20)
at addChunk (_stream_readable.js:297:12)
at readableAddChunk (_stream_readable.js:272:9)
at Socket.Readable.push (_stream_readable.js:213:10)
at TCP.onStreamRead (internal/stream_base_commons.js:188:23)
at TCP.callbackTrampoline (internal/async_hooks.js:126:14)
GET key1: val1
The user's "trace-redis.js" script doesn't appear in the stack for Span.end at all, so it becomes hard/impossible for them to track down the source code location that made the redis client call.
I'll create a ticket (https://github.com/elastic/apm-agent-nodejs/issues/2092) for the node.js agent to:
Error.captureStackTrace(obj, Span)
when captureSpanStackTraces
is enabled.captureSpanStackTraces
being false by default (possibly in the next major rev of the agent) if the perf impact is at all significant. (Paying CPU for span.stacktrace that one very rarely sees/uses is hard to justify).Would/will it be a pain (i.e. breaking any consistency goals) if the Node.js APM agent is the only one that requires a user to change the config to ever capture a span.stacktrace
?
Paying CPU for span.stacktrace that one very rarely sees/uses is hard to justify
Agree
IIRC, the .NET agent might also sometimes take stack trace snapshots at the beginning of a span, to make async tracing more useful. @gregkalapos, is that correct?
I guess there may be similar concerns for the Java agent. It only ever captures stack traces at the end of a span. But if the user's app is reactive/async, the stack traces may not be useful. But as most Java apps are synchronous, it's not as big of an issue as in Node.js.
IIRC, the .NET agent might also sometimes take stack trace snapshots at the beginning of a span, to make async tracing more useful. @gregkalapos, is that correct?
Yes, that's correct. One example for this is outgoing HTTP requests.
Would/will it be a pain (i.e. breaking any consistency goals) if the Node.js APM agent is the only one that requires a user to change the config to ever capture a span.stacktrace?
Looks like the same also applies to .NET.
So either we consistently set the default to -1 (never capture stack trace) for all agents, or we set it to -1 only for agents that would need to capture the stack at the start of the span.
In the UI, when the stack trace tab is empty, we could link to central config so that users who want stack traces can customize span_stack_trace_min_duration
.
Opinions?
In the UI, when the stack trace tab is empty, we could link to central config so that users who want stack traces can customize span_stack_trace_min_duration.
For OTel agents we need to hide the stack trace tab completely, I guess. So we would need to differentiate OTel agents from our own agents for the behavior of this tab.
Would/will it be a pain (i.e. breaking any consistency goals) if the Node.js APM agent is the only one that requires a user to change the config to ever capture a span.stacktrace?
Looks like the same also applies to .NET.
So either we consistently set the default to -1 (never capture stack trace) for all agents, or we set it to -1 only for agents that would need to capture the stack at the start of the span. In the UI, when the stack trace tab is empty, we could link to central config so that users who want stack traces can customize
span_stack_trace_min_duration
.Opinions?
I can't tell if "off by default" would be a bad thing from a user's point of view, but I know that not capturing call stacks would have a very positive impact on performance for .NET. So from this reason, -1 as default sounds good to me at least for .NET and Node.js. I can imagine for other agents this is a non-issue and a higher default (e.g. 500ms as suggested here) would be ok.
Only speaking for .NET: On the other hand, we don't seem to have complains for this, so that makes the decision harder. Usually I tend to go with an approach which causes the least questions and issues for users. So with that, although I think retrospectively -1
as default would be better, I'd still stick with the "on by default", just to not confuse people (the idea with the UI help sounds good, still it could be a hassle for users since it was on be default in all prior versions). So at the end, defaulting to 500ms
sounds good - we know that for .NET in case of async that won't help much, but so be it. And then if we have users complaining (which we haven't had in years) only then would I change the default to -1.
For Node.js, we did get complains and that setting is a major reason why some users found it hard to adopt the agent due to the performance impact. Therefore, our recommended default configuration is to disable the feature for Node.js.
What's different in .NET is that the stack trace is not captured in the beginning for all spans. Also, stack trace collection may be not as expensive as in Node.js. However, if we measured the impact for .NET, I would not be surprised if stack trace collection is still a major part of the overhead the agent creates.
Having said that, I think it's fine if we start with -1ms
as a default for Node.js and 500ms
for others.
Per the discussion in the agents weekly today, I think python will probably retain the old default of 5ms
and make this setting more prominent in the troubleshooting/tuning docs in case users want to reduce overhead. I think it's a very useful feature and we don't have many (any?) performance complaints at the moment. /cc @beniwohli
I'll wait a few days for more input but for now it seems we can close this and just change the default to -1ms
in the Node.js agent in an upcoming minor version. Other agents may choose to change their defaults, too, if they see a need to tune their defaults for better performance. It's ok that different agents have different defaults.
close this and just change the default to -1ms in the Node.js agent in an upcoming minor versio
Sounds good to me. We also have to first add the span_stack_trace_min_duration
config var (deprecating the existing captureSpanStackTraces
and spanFramesMinDuration
config vars) -- we haven't yet done https://github.com/elastic/apm-agent-nodejs/issues/2381 :).
Open questions
Related
CODEOWNERS
)