Increase span_stack_trace_min_duration

felixbarny commented 2 years ago

Errors are unaffected in this proposal
Performance impact too high as-is
Bad cost/benefit ratio
Most other tracers never capture stack traces

Open questions

Can we do this in a minor?

https://github.com/elastic/kibana/issues/68757

[x] Create PR as draft
[ ] Approval by at least one other agent
[ ] Mark as Ready for Review (automatically requests reviews from all agents and PM via CODEOWNERS)
- Remove PM from reviewers if impact on product is negligible
- Remove agents from reviewers if the change is not relevant for them
[ ] Merge after 2 business days passed without objections

apmmachine commented 2 years ago

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

#### Build stats * Reason: `null` * Start Time: 2022-01-28T04:20:00.774+0000 * Duration: 7 min 25 sec * Commit: 686f052b9f0dbeb710c903f49eac8b33a1ace442

trentm commented 2 years ago

Can we do this in a minor?

Lacking some strict process for deciding, my vote would be "yes". I would be pretty surprised if the change broke a user. They might be surprised that some spans are no longer showing a stacktrace in APM UI, but not broken. The performance benefit for Node.js apps with any significant load will be clear.

felixbarny commented 2 years ago

we are still doing Error.captureStackTrace(obj, Span) on every span creation

Could you do that at the end of a span, after the duration is known? Or would stack traces that are taken on span end be useless because of the async nature of Node.js?

trentm commented 2 years ago

Could you do that at the end of a span, after the duration is known? Or would stack traces that are taken on span end be useless because of the async nature of Node.js?

The stack traces would often be useless. For a small Redis example script ("trace-redis.js") that does a GET key1:

const apm = require('elastic-apm-node').start({
  serviceName: 'trace-redis'
})
const redis = require('redis')
const client = redis.createClient()

const t1 = apm.startTransaction('t1')
client.get('key1', function (err, reply) {
  console.log('GET key1: %s', err ? `${err.name}: ${err.message}` : reply)
  t1.end()
  client.quit()
})

Here is the stack at Span start, and at Span end:

% node trace-redis.js
XXX stack at Span start Error: here
    at new Span (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/span.js:46:43)
    at Transaction.createSpan (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:146:10)
    at Instrumentation.createSpan (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/index.js:435:27)
    at RedisClient.wrappedInternalSendCommand [as internal_send_command] (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/modules/redis.js:73:35)
    at RedisClient.get (/Users/trentm/tmp/myapp/node_modules/redis/lib/commands.js:46:25)
    at Object.<anonymous> (/Users/trentm/tmp/myapp/trace-redis.js:8:8)
    at Module._compile (internal/modules/cjs/loader.js:999:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1027:10)
    at Module.load (internal/modules/cjs/loader.js:863:32)
    at Function.Module._load (internal/modules/cjs/loader.js:708:14)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:60:12)
    at internal/main/run_main_module.js:17:47
XXX stack at Span#end Error: here
    at Span.end (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/span.js:72:41)
    at Command.<anonymous> (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/modules/redis.js:49:12)
    at /Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/run-context/AbstractRunContextManager.js:76:49
    at AsyncHooksRunContextManager.with (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/run-context/BasicRunContextManager.js:49:17)
    at Command.wrapper [as callback] (/Users/trentm/tmp/myapp/node_modules/elastic-apm-node/lib/instrumentation/run-context/AbstractRunContextManager.js:76:23)
    at normal_reply (/Users/trentm/tmp/myapp/node_modules/redis/index.js:655:21)
    at RedisClient.return_reply (/Users/trentm/tmp/myapp/node_modules/redis/index.js:753:9)
    at JavascriptRedisParser.returnReply (/Users/trentm/tmp/myapp/node_modules/redis/index.js:138:18)
    at JavascriptRedisParser.execute (/Users/trentm/tmp/myapp/node_modules/redis-parser/lib/parser.js:544:14)
    at Socket.<anonymous> (/Users/trentm/tmp/myapp/node_modules/redis/index.js:219:27)
    at Socket.emit (events.js:314:20)
    at addChunk (_stream_readable.js:297:12)
    at readableAddChunk (_stream_readable.js:272:9)
    at Socket.Readable.push (_stream_readable.js:213:10)
    at TCP.onStreamRead (internal/stream_base_commons.js:188:23)
    at TCP.callbackTrampoline (internal/async_hooks.js:126:14)
GET key1: val1

The user's "trace-redis.js" script doesn't appear in the stack for Span.end at all, so it becomes hard/impossible for them to track down the source code location that made the redis client call.

I'll create a ticket (https://github.com/elastic/apm-agent-nodejs/issues/2092) for the node.js agent to:

Get a ballpark estimate of the perf impact of that Error.captureStackTrace(obj, Span) when captureSpanStackTraces is enabled.
Argue for captureSpanStackTraces being false by default (possibly in the next major rev of the agent) if the perf impact is at all significant. (Paying CPU for span.stacktrace that one very rarely sees/uses is hard to justify).
Update the performance tuning guide.

trentm commented 2 years ago

Would/will it be a pain (i.e. breaking any consistency goals) if the Node.js APM agent is the only one that requires a user to change the config to ever capture a span.stacktrace?

felixbarny commented 2 years ago

Paying CPU for span.stacktrace that one very rarely sees/uses is hard to justify

Agree

IIRC, the .NET agent might also sometimes take stack trace snapshots at the beginning of a span, to make async tracing more useful. @gregkalapos, is that correct?

I guess there may be similar concerns for the Java agent. It only ever captures stack traces at the end of a span. But if the user's app is reactive/async, the stack traces may not be useful. But as most Java apps are synchronous, it's not as big of an issue as in Node.js.

gregkalapos commented 2 years ago

IIRC, the .NET agent might also sometimes take stack trace snapshots at the beginning of a span, to make async tracing more useful. @gregkalapos, is that correct?

Yes, that's correct. One example for this is outgoing HTTP requests.

felixbarny commented 2 years ago

Would/will it be a pain (i.e. breaking any consistency goals) if the Node.js APM agent is the only one that requires a user to change the config to ever capture a span.stacktrace?

Looks like the same also applies to .NET.

So either we consistently set the default to -1 (never capture stack trace) for all agents, or we set it to -1 only for agents that would need to capture the stack at the start of the span. In the UI, when the stack trace tab is empty, we could link to central config so that users who want stack traces can customize span_stack_trace_min_duration.

Opinions?

AlexanderWert commented 2 years ago

In the UI, when the stack trace tab is empty, we could link to central config so that users who want stack traces can customize span_stack_trace_min_duration.

For OTel agents we need to hide the stack trace tab completely, I guess. So we would need to differentiate OTel agents from our own agents for the behavior of this tab.

gregkalapos commented 2 years ago

Would/will it be a pain (i.e. breaking any consistency goals) if the Node.js APM agent is the only one that requires a user to change the config to ever capture a span.stacktrace?

Looks like the same also applies to .NET.

So either we consistently set the default to -1 (never capture stack trace) for all agents, or we set it to -1 only for agents that would need to capture the stack at the start of the span. In the UI, when the stack trace tab is empty, we could link to central config so that users who want stack traces can customize span_stack_trace_min_duration.

Opinions?

I can't tell if "off by default" would be a bad thing from a user's point of view, but I know that not capturing call stacks would have a very positive impact on performance for .NET. So from this reason, -1 as default sounds good to me at least for .NET and Node.js. I can imagine for other agents this is a non-issue and a higher default (e.g. 500ms as suggested here) would be ok.

Only speaking for .NET: On the other hand, we don't seem to have complains for this, so that makes the decision harder. Usually I tend to go with an approach which causes the least questions and issues for users. So with that, although I think retrospectively -1 as default would be better, I'd still stick with the "on by default", just to not confuse people (the idea with the UI help sounds good, still it could be a hassle for users since it was on be default in all prior versions). So at the end, defaulting to 500ms sounds good - we know that for .NET in case of async that won't help much, but so be it. And then if we have users complaining (which we haven't had in years) only then would I change the default to -1.

felixbarny commented 2 years ago

For Node.js, we did get complains and that setting is a major reason why some users found it hard to adopt the agent due to the performance impact. Therefore, our recommended default configuration is to disable the feature for Node.js.

What's different in .NET is that the stack trace is not captured in the beginning for all spans. Also, stack trace collection may be not as expensive as in Node.js. However, if we measured the impact for .NET, I would not be surprised if stack trace collection is still a major part of the overhead the agent creates.

Having said that, I think it's fine if we start with -1ms as a default for Node.js and 500ms for others.

basepi commented 2 years ago

Per the discussion in the agents weekly today, I think python will probably retain the old default of 5ms and make this setting more prominent in the troubleshooting/tuning docs in case users want to reduce overhead. I think it's a very useful feature and we don't have many (any?) performance complaints at the moment. /cc @beniwohli

felixbarny commented 2 years ago

I'll wait a few days for more input but for now it seems we can close this and just change the default to -1ms in the Node.js agent in an upcoming minor version. Other agents may choose to change their defaults, too, if they see a need to tune their defaults for better performance. It's ok that different agents have different defaults.

trentm commented 2 years ago

close this and just change the default to -1ms in the Node.js agent in an upcoming minor versio

Sounds good to me. We also have to first add the span_stack_trace_min_duration config var (deprecating the existing captureSpanStackTraces and spanFramesMinDuration config vars) -- we haven't yet done https://github.com/elastic/apm-agent-nodejs/issues/2381 :).

elastic / apm

Increase span_stack_trace_min_duration #585

:green_heart: Build Succeeded