erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.43k stars 2.96k forks source link

ERL-355: LTTng dynamic tracing: 'depth' field for 'function_call' for 'function_return' tracepoints #3294

Open OTP-Maintainer opened 7 years ago

OTP-Maintainer commented 7 years ago

Original reporter: boris.okner Affected version: OTP-19.1.1 Component: runtime_tools Migrated from: https://bugs.erlang.org/browse/ERL-355


When using 'function_call' and 'function_return' tracepoints, LTTng viewer always show 'depth = 0' regardless of the actual stack depth. Following is the sample of the LTTng viewer output showing one function calling another, with the inner function call and return having 'depth = 0' (whereas it should have been 1):

[20:56:09.773366183] (+0.000000084) jabber-node03 org_erlang_dyntrace:function_call: { cpu_id = 4 }, { pid = "<0.31599.6>", entry = "ezlib:process/2", depth = 0 }
[20:56:09.773379981] (+0.000000136) jabber-node03 org_erlang_dyntrace:function_call: { cpu_id = 4 }, { pid = "<0.31599.6>", entry = "ezlib_nif:process_buffer/2", depth = 0 }
[20:56:09.773399320] (+0.000000762) jabber-node03 org_erlang_dyntrace:function_return: { cpu_id = 4 }, { pid = "<0.31599.6>", entry = "ezlib_nif:process_buffer/2", depth = 0 }
[20:56:09.773406352] (+0.000000256) jabber-node03 org_erlang_dyntrace:function_return: { cpu_id = 4 }, { pid = "<0.31599.6>", entry = "ezlib:process/2", depth = 0 }

Having depth value right is critical for us to properly analyze the traces, as it gives a shield against lost lttng records and/or function calls that do not return (either by design or for other reasons). 
OTP-Maintainer commented 7 years ago

lukas said:

The lttng tracing is implemented using the erl_tracer NIF framework and thus has access to the same information as exposed through the nif API. At the moment, it is not possible to get the current stack size from a NIF so the lttng probes do not provide it. The reason it is in the trace at all it because the dtrace/systemtap probes have it and we want the different tracing technologies to be consistent.

With regards to your assertion that you need it in order to analyze traces. Having the depth will not give you any guarantees that you will catch all lost trace messages. For instance the process may pop the stack multiple times and then call down again to the same function using a different path. So no matter what you do, you will have to deal with the possibility of lost trace messages and not being able to reconstruct the call path of a process.

That being said, if you or someone else wants to contribute a way for nifs to get this information, I don't think we'd have any problems with that. However with the way that dirty-schedulers work it will be a non-trivial thing to implement.
OTP-Maintainer commented 7 years ago

lukas said:

It may also be possible to implement via extending the matchspec functions related to tracing, in the same way as process_dump works.
OTP-Maintainer commented 7 years ago

boris.okner said:

Thank you for the clarification and suggestions. You are right in that having depth value alone is not sufficient to reliably analyze the call path. However, it will greatly reduce the number of cases where the analyzing code has to make complex decisions to deal with the broken call stack.
It would also be great if the documentation mentioned that the 'depth' field does not currently represent the stack depth and is there for consistency only.