Linux: control of tracing

tmds commented 7 years ago

Tracing is disabled by default and requires COMPlus_EnableEventLog=1 to be set. When setting this variable all EventSources will IsEnable() == true even when the lttng DotNETRuntime:EventSource is not enabled. So these EventSources will do allocations&computations for which the results are thrown away.

Can we give control to LTTng to enable/disable trace events indepentent of COMPlus_EnableEventLog? For EventSources we can still control IsEnable via COMPlus_EnableEventLog. This will make it possible to trace but avoid allocations&computations. And also make it possible to do tracing (except EventSource) of apps that don't have COMPlus_EnableEventLog.

Ideally EventSource.IsEnabled would reflect DotNETRuntime:EventSource enabled. Then we no longer need COMPlus_EnableEventLog.

tmds commented 7 years ago

@jkotas any thoughts on the proposed change?

jkotas commented 7 years ago

@brianrob @vancem Could you please comment on this?

vancem commented 7 years ago

First, the current state of Linux tracing is not what we want. It was a step in the direction that was better than nothing. Ultimately we don't want environment variables to be the controlling factor for Lttng tracing but use the LttNG mechanisms.

I assume you have seen the document

https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/linux-performance-tracing.md

Note the section on filtering where you can use

COMPLUS_EventSourceFilter – filter event sources by name
COMPLUS_EventNameFilter – filter events by name

Again this is a stop-gap measure that gives you some control immediately as we work to make progress on a better approach.

@brianrob can comment more on what our plans are.

brianrob commented 7 years ago

@tmds, what you are asking for is exactly what should be done. As of LTTng 2.7, it is now possible to check whether or not an event is enabled via a call to tracepoint_enabled - see http://lttng.org/docs/v2.10/#doc-probing-the-application-source-code.

Once the runtime can call tracepoint_enabled, then we can remove COMPlus_EnableEventLog.

With respect to EventSources, right now all of them are emitted via one tracepoint in LTTng. So, we can control EventSource.IsEnabled through a call to tracepoint_enabled, specifying the EventSource event type.

@tmds, if this is something that you're interested in helping with, I'd be glad to give you some more details about how you can go about implementing this.

tmds commented 7 years ago

I assume you have seen the document https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/linux-performance-tracing.md

Seems I missed the filtering section :open_mouth:

COMPLUS_EventSourceFilter – filter event sources by name COMPLUS_EventNameFilter – filter events by name

I just tried this. Looks like these envvars are case-sensitive: COMPLUS_ should be COMPlus_ instead.

@tmds, if this is something that you're interested in helping with, I'd be glad to give you some more details about how you can go about implementing this.

Definitely! You can reply on the issue or send me a mail (whichever you prefer): tdeseyn@redhat.com.

vancem commented 7 years ago

I fixed the Caseing of COMPLus_ variables in the doc

tmds commented 7 years ago

@brianrob I haven't heard from you, so I'm wondering if you saw I'd like to contribute.

brianrob commented 7 years ago

@tmds, apologies, this slipped by me. Thank you for offering to contribute.

Unfortunately, as I review this area, removal of the environment variable is more complex than I remembered. The calls to tracepoint_enabled are actually already present in much of the codebase, but they are still guarded for the most part by the environment variable check. There are a couple of issues here:

There are a few ifdefs in the runtime that allow you to test whether or not logging is enabled, but they work on ETWisms - specifically, keyword and level (these represent groups of events). LTTng operates at the event level, and so we'd need to either figure out how to emulate these, or see if we can get rid of the calls entirely, and replace them with event level tests. I think we should do the latter.
EventSources are wired such that ETW "calls back" into them when it tries to enable them. ETW does this by injecting a thread into the process and calling a registered callback function. LTTng does not have any such support. It is something that they are open to, and @compudj actually produced a prototype for this sometime back, but we've not yet done the work to sponsor getting it merged. One option here might be to have a scheme where we periodically check to see if the tracepoint has been enabled, but we may be better off trying to get the support we need into LTTng and then re-address this. @vancem, what are your thoughts on this?

With respect to the environment variable itself - there is work that we can do in the runtime now to get ready to remove the environment variable that will make this whole scheme much cleaner.

Start addressing the usages of the few ifdefs that make this difficult. The ifdefs are defined here: https://github.com/dotnet/coreclr/blob/master/src/inc/eventtracebase.h#L112. I suspect that the usages will fall into a few categories and we can come up with some rules to replace them.
We should be able to remove the environment variable check here https://github.com/dotnet/coreclr/blob/master/src/scripts/genXplatEventing.py#L287 because we already have the call to tracepoint_enabled wired into these checks.

Would you be up for giving this a shot?

tmds commented 7 years ago

We should be able to remove the environment variable check here https://github.com/dotnet/coreclr/blob/master/src/scripts/genXplatEventing.py#L287 because we already have the call to tracepoint_enabled wired into these checks.

@brianrob XplatEventLogger::IsEventLoggingEnabled is inlined and EventXplatEnabledSomeEvent() isn't. Is the goal to be capable of inlining the latter? To do that, the caller of EventXplatEnabledSomeEvent needs to take a direct dependency on lttng. I believe this is currently an indirect dependency. runtime depends on eventprovider which depends on lttng. What are your thoughts on this?

brianrob commented 7 years ago

@tmds, you are right, EventXplatEnabledSomeEvent is not inlined, as it is implemented in the PAL. Last I checked, the VM (where this code is called from) can't take a dependency directly on LTTng, but instead must go through the PAL. It would be nice to be able to inline the checks, as it's certainly going to be cheaper. @janvorli, is it still the case that we can't call LTTng from the VM directly?

The reason for the environment variable wasn't necessarily that the call can be inlined, but that it was an easy way to ensure that we didn't experience tracing overhead when tracing wasn't required. This implementation essentially deferred properly implementing the isenabled defines in eventtracebase.h or replacing them with event specific isenabled checks.

brianrob commented 7 years ago

@nategraf

janvorli commented 7 years ago

@brianrob VM can never include any platform headers, because it would collide with our private declarations / implementations of some functions that also exist in the platform headers. You've said that the EventXplatEnabledSomeEvent is implemented in PAL, but isn't it actually in the libcoreclrtraceptprovider.so?

tmds commented 7 years ago

@brianrob VM can never include any platform headers, because it would collide with our private declarations / implementations of some functions that also exist in the platform headers.

@janvorli does that mean it must remain separate as-is, and we shouldn't try to inline the lttng event enabled checks?

compudj commented 7 years ago

FWIW, the lttng-ust public headers symbols are "namespaced" with a prefix to every symbol. So colliding with lttng-ust's public declarations would have to be done intentionally.

brianrob commented 7 years ago

@janvorli, I had to go back and check this to be sure.

EventXplatEnabledSomeEvent is implemented in the PAL. It is generated at build time and dropped in bin/obj/Linux.x64./Generated/eventprovider/LTTng (see eventprovdotnetruntime.cpp in this directory for some examples). These are included in the PAL via a generated CMakeLists.txt here: https://github.com/dotnet/coreclr/blob/master/src/pal/src/CMakeLists.txt#L378.

The tracepoint definitions themselves are also generated and placed at bin/obj/Linux.x64./Generated/eventprovider/LTTng as well. They live in the tpdotnetruntime header files and traceptprovdotnetruntime cpp files. They are compiled into libcoreclrtraceptprovider.so via the CMakeLists.txt that is generated and placed in bin/obj/Linux.x64./Generated/eventprovider/tracepointprovider.

So, the calls to the LTTng functions tracepoint_enabled and do_tracepoint occur in the PAL, but use the pre-compiled tracepoint definitions that exist in libcoreclrtraceptprovider.so.

From an OS API perspective, the PAL factoring makes sense to me - we want to make sure that we have one set of platform APIs.

For tracing, I'm wondering if we can do things a bit differently. The way tracing works on Windows is that all of the IsEnabled checks occur inside the VM and are inlined - same with all of the argument setup and other ETW work that is required before we can log an event. This is because the code that we use to do this is all generated based on our ETW manifest and is compiled directly into CoreCLR.dll. It does not go back and forth to the OS to do this work. The compiled manifest only calls out to OS once the event is ready to be logged (just one OS call). It would be great if we could do this with LTTng as well, since it looks like it does the same thing - it only calls out of the process once the event is ready to be written to the tracepoint. The cost of calling tracepoint_enabled is always just checking an in-proc state variable, and the cost of calling do_tracepoint is essentially the same as tracepoint_enabled when the tracepoint is disabled. This would reduce the overhead from required function calls down to very tiny inlined functions.

What do you think @janvorli?

@compudj - is my explanation of the cost of tracepoint_enabled and do_tracepoint correct? Please help keep me honest. :)

compudj commented 7 years ago

Considering that the do_tracepoint() call should only happen when tracepoint_enabled() evaluates to true, you can indeed assume that do_tracepoint() is never called when the tracepoint is disabled, and therefore the whole cost boils down to checking an in-proc state variable, and doing a predicted branch, when the tracepoint is disabled.

janvorli commented 7 years ago

@brianrob I wonder, since the PAL is part of libcoreclr.so, the linker could be able to inline the calls when generating the final code.

tmds commented 7 years ago

@brianrob @janvorli I will try to implement some of these changes (without the inlining) next week.

brianrob commented 7 years ago

Thanks @tmds. Let me know if you have any questions.

@janvorli, I don't think that it can unless they are part of the same module. I'm not 100% sure on LLVM, but I know this is not possible with VC++.

vancem commented 7 years ago

Ideally we have a model where the LTTng logging system will call back into the runtime when it issues 'commands' (state changes). That allows the runtime to create a local data structure that knows how to filter at least in the common cases.

Even if LTTng does not support this, we can 'fake it' by doing polling at some interval (e.g. every GC, if one has not happened in the last second), until the time it does. We should work toward the model that logging systems send us commands (events) when state might change and we can cache things an know it does not change until the next command.

brianrob commented 7 years ago

Thanks @vancem. Agreed. We should look into working on the prototype that @compudj did to allow callbacks into the application when starting tracing.

@compudj, one thing that would be needed in order to make this work would be callbacks on both start and stop. Do you think this is a reasonable extension of the prototype that you did for the StateDumpOnStart notifiers?

brianrob commented 7 years ago

Also, @compudj thanks for confirming my understanding of do_tracepoint.

janvorli commented 7 years ago

I don't think that it can unless they are part of the same module. I'm not 100% sure on LLVM, but I know this is not possible with VC++.

@brianrob I am not sure I understand. PAL is part of libcoreclr.so, so the stuff from PAL and the runtime form one module.

brianrob commented 7 years ago

@janvorli, you're right, they do form one module. However, they are produced from two libs, right? One for the VM and one for the PAL. The calls cross the VM --> PAL boundary, and so they are declared as extern functions. To my knowledge, extern functions can be called from outside the lib, but they cannot be inlined into another lib.

tmds commented 7 years ago

EventSources are wired such that ETW "calls back" into them when it tries to enable them. ETW does this by injecting a thread into the process and calling a registered callback function. LTTng does not have any such support. It is something that they are open to, and @compudj actually produced a prototype for this sometime back, but we've not yet done the work to sponsor getting it merged. One option here might be to have a scheme where we periodically check to see if the tracepoint has been enabled, but we may be better off trying to get the support we need into LTTng and then re-address this. @vancem, what are your thoughts on this?

This means you need to wait for your target platforms to have an updated lttng. Perhaps it is possible to signal all EventSources they are enabled? And wire EventSource.IsEnabled to the tracepoint? Currently all EventSources are mapped to the same tracepoint. Will this change? Does lttng allow to register additional tracepoints dynamically? Do you have some thoughts on handling EventLevels?

Start addressing the usages of the few ifdefs that make this difficult. The ifdefs are defined here: https://github.com/dotnet/coreclr/blob/master/src/inc/eventtracebase.h#L112. I suspect that the usages will fall into a few categories and we can come up with some rules to replace them.

There are about 80 usages of ETW_TRACING_CATEGORY_ENABLED. Should I make an overview of the Context, Level, Keywords used?

We should be able to remove the environment variable check here https://github.com/dotnet/coreclr/blob/master/src/scripts/genXplatEventing.py#L287 because we already have the call to tracepoint_enabled wired into these checks.

I removed those checks in https://github.com/dotnet/coreclr/pull/14706. PTAL.

brianrob commented 7 years ago

Yes, I expect us to map each EventSource event to an individual tracepoint once this is possible. Yes, you're right this requires an updated LTTng, but it's something that we can require on a release boundary.

I think an overview of what uses ETW_TRACING_CATEGORY_ENABLED is good - specifically the types of uses so that we can try to come up with a good pattern on how to replace them.

tmds commented 6 years ago

Yes, you're right this requires an updated LTTng, but it's something that we can require on a release boundary.

It's the underlying platform that needs to support that library. So I don't think dotnet can enforce this at a release boundary.

Overview of ETW_TRACING_CATEGORY_ENABLED used in coreclr:

ceeload.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, KEYWORDZERO
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PRIVATE_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_PERFTRACK_PRIVATE_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PRIVATE_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_PRIVATENGENFORCERESTORE_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_ENDENUMERATION_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_GCHEAPALLOCHIGH_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_GCHEAPALLOCLOW_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_GCHEAPANDTYPENAMES_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_GCHEAPDUMP_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_GCHEAPSURVIVALANDMOVEMENT_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_GC_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_JIT_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_JITTEDMETHODILTONATIVEMAP_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_LOADER_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_NGEN_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_OVERRIDEANDSUPPRESSNGENEVENTS_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_PERFTRACK_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_STARTENUMERATION_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_THREADING_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_TYPE_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, KEYWORDZERO
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_VERBOSE, CLR_CODESYMBOLS_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_VERBOSE, CLR_JIT_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_VERBOSE, KEYWORDZERO
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_RUNDOWNAPPDOMAINRESOURCEMANAGEMENT_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_RUNDOWNEND_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_RUNDOWNJIT_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_RUNDOWNJITTEDMETHODILTONATIVEMAP_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_RUNDOWNLOADER_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_RUNDOWNNGEN_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_RUNDOWNOVERRIDEANDSUPPRESSNGENEVENTS_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_RUNDOWNPERFTRACK_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_RUNDOWNSTART_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_RUNDOWNTHREADING_KEYWORD
eventtrace.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_Context, TRACE_LEVEL_VERBOSE, KEYWORDZERO
jitinterface.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_VERBOSE, CLR_JITTRACING_KEYWORD
multicorejitplayer.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PRIVATE_PROVIDER_Context, TRACE_LEVEL_VERBOSE, CLR_PRIVATEMULTICOREJIT_KEYWORD
prestub.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_VERBOSE, CLR_JIT_KEYWORD
proftoeeinterfaceimpl.cpp: MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_GCHEAPDUMP_KEYWORD

tmds commented 6 years ago

@brianrob what needs to happen with the ETW_TRACING_CATEGORY_ENABLEDs?

A suggestion. Perhaps we can add a define:

#define ETW_TRACING_CATEGORY_ENABLED(Context, Level, Keyword) ETW_TRACING_CATEGORY_##Context##_##Level##_##Keyword##_ENABLED()

And the add functions to map the specific ETW_TRACING_CATEGORY_ENABLED to EventDescriptors? e.g.

static inline bool ETW_TRACING_CATEGORY_MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context_TRACE_LEVEL_VERBOSE_CLR_JIT_KEYWORD()
{
    return EventXplatEnabledGCStart_V2() || EventXplatEnabledGCEnd_V1();
}

Or do you have something else in mind?

brianrob commented 6 years ago

@tmds, thanks for the overview. It doesn't look like there are too many sites involved, so that's good. I would change your suggestion slightly. Rather than maintain the existing define ETW_TRACING_CATEGORY_ENABLED, I'm tempted to remove it and replace each site where it is used with a check of one or more events. This makes the code clear that it is looking for a particular event instead of a category, which doesn't exist on all logging platforms.

As an example, instead of having:

ETW_TRACING_CATEGORY_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, TRACE_LEVEL_INFORMATION, CLR_GC_KEYWORD)

let's change it to checking for an event that is part of the specific provider and matches the verbosity requested:

ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, GCStart_V2)

Staying at this abstraction level also ensures that this functionality continues to work for the EventPipe which is not invoked through calls to EventXplatEnabled*.

It's probably also worthwhile to put some comments at the code sites so that the intent of these checks is understood.

How does that sound?

cc: @vancem for feedback as well.

tmds commented 6 years ago

@brianrob Sounds good. I was assuming the ETW_TRACING_CATEGORY_ENABLED could not be touch for reasons of compatibility with existing Windows EWT tracing behavior.

brianrob commented 6 years ago

@tmds, usually I'd rather not touch uses of the macros because there are many of them out there and it's nice to not have to change the kind of functionality available within the system. Unfortunately, this one really doesn't mesh well when we start going cross-platform, so I think it's probably best to remove it.

Do you want to give this a try? Perhaps a good first step is to figure out which event(s) should be used to replace each call.

tmds commented 5 years ago

EventPipe provides a mechanism for enabling/disabling tracing.

dotnet / runtime

Linux: control of tracing #9060