LTTNG/EventPipe filtering occurs too late, causes heavy perf overhead

noahfalk commented 5 years ago

We have various places in our code where we test ETW_TRACING_CATEGORY_ENABLED(…), then do a somewhat costly operation to produce some data, then log that data into the tracing system. EventPipe and LTTNG will filter events after they have been logged, but ETW_TRACING_CATEGORY_ENABLED returns true which means we still pay the perf overhead for generating the data to be logged. A better implementation would update the ETW_TRACING_CATEGORY_ENABLED macro so that the filtering occurs at the start. I am relaying this second-hand from @chrisnas who included two examples:

ETW::TypeSystemLog::SendObjectAllocatedEvent -this one caused a 60+% increase in startup time

s_fHeapAllocEventEnabledOnStartup is inited in the code here using ETW_TRACING_CATEGORY_ENABLED

A simpler example using ETW::GCLog::FireGcStart

This issue prevents people from effectively turning on LTTNG/EventPipe in performance sensitive scenarios because regardless of how aggressively they filter, the runtime automatically performs the pre-logging work for all events which has significant overhead.

noahfalk commented 5 years ago

cc @jorive, @vancem, @cshung @sywhang

chrisnas commented 5 years ago

Thanks Noah! Note that the 60% increase was for our response time (not the startup) BTW, https://github.com/dotnet/coreclr/issues/14327 seems also related to the same root cause

noahfalk commented 5 years ago

Yeah, way down near the bottom of dotnet/runtime#9060 seems to bump up against the issue here. I'll leave this issue open because I think it is a more specific description of the problem.

sywhang commented 5 years ago

Do we also want to modify macros that are doing similar thing as ETW_TRACING_CATEGORY_ENABLED? In particular these two:

#define ETW_EVENT_ENABLED(Context, EventDescriptor) (EventPipeHelper::Enabled() || XplatEventLogger::IsEventLoggingEnabled())
#define ETW_CATEGORY_ENABLED(Context, Level, Keyword) (EventPipeHelper::Enabled() || XplatEventLogger::IsEventLoggingEnabled())

noahfalk commented 5 years ago

Do we also want to modify macros that are doing similar things...

Yeah, I think we should. The ultimate goal is that runtime perf with LTTng turned on in a configuration that logs very few events (like GCCollectOnly) should have only neglible overhead relative to LTTng being off.

sywhang commented 5 years ago

Chatted with @noahfalk offline today and wanted to log it here for bookkeeping purposes :-)

On EventPipe side, I'll be sending out a fix soon with a change for these macros.

On LTTng side, things are a little more complicated as we don't have a good way to check (yet) on whether certain providers/keywords/levels are enabled, but all the callsites to these macros are using combinations of providers/keywords/levels. A potential solution would be to generate helper methods that check this via probing the right "event" under a given combination of provider/keyword/level from using ClrEtwAll.man file that we already parse to generate tracepoints for LTTng. On the other hand, there is another option of changing all these callsites to check for whether an event is enabled (which we can already do onLTTng), and adding a helper method on the ETW/EventPipe side to see if a particular event is enabled by checking its corresponding provider/keyword/level combination. I just checked and the number of callsites to these macros seems to be 100s, not 1000s - sorry for the false alarm there @noahfalk :p - so I would gladly sacrifice an afternoon of my time to fix all of these. Whichever way we decide to take, the fix for LTTng won't be happening anytime soon because I'm blocked with EventCounters work but I'm hoping to fix it in the 3.0 timeframe.

jkotas commented 5 years ago

so I would gladly sacrifice an afternoon of my time to fix all of these.

Why can't we just fix the macros? Do every one of these callsites really need to change?

sywhang commented 5 years ago

For EventPipe that's doable (and is what the fix looks like). On LTTng we don't have a notion of checking for a combination of provider/keyword/level, and we can only check for whether a single event is enabled or not.

As mentioned above, changing all the callsites is not needed if we decide to go to the other path which is writing helper methods that will probe whether "just the right event" for a particular provider/keyword/level is enabled or not, and then changing those macros accordingly. I'm open to either options but if you or Vance feel strongly for one over the other, please let me know!

noahfalk commented 5 years ago

On LTTng we don't have a notion of checking for a combination of provider/keyword/level

For clarity, it is LTTng that doesn't have a notion of keywords, not a lack of effort or incompleteness on the runtime's part : ) Our choice is how do we handle the mismatch between macros that assume keywords exist and a tracing system that doesn't provide them.

On a little further thought I think a reasonable approach might be taking the set of events (A, B, C) that are enabled by a keyword and assuming that if LTTng has enabled A || B || C then the keyword is active. It might be a little over-conservative for cases where LTTng enabled event C and the code looks like:

if(is_keyword_enabled(Foo))
{
   arg = do_some_work();
   send_event_a(arg);
}

In such cases we'll waste effort setting up to deliver event A that LTTng isn't listening for as well doing extra conditional checks in the implementation of is_keyword_enabled, I'm just making a bet that it isn't a common enough case with enough perf overhead that we'll need to care. If it did turn out to matter we could always refactor particular sites where problems arose.

gleocadie commented 5 years ago

On a little further thought I think a reasonable approach might be taking the set of events (A, B, C) that are enabled by a keyword and assuming that if LTTng has enabled A || B || C then the keyword is active.

This could have an impact if we don't take into account the level. Since there are events that belong to more than one keyword (not so much but still). (e.g: GCDynamicEvent ) With the GCDynamicEvent event, we will have GCKeyword, GCHandleKeyword, GCHeapDumpKeyword, GCSampledObjectAllocationHighKeyword, GCHeapSurvivalAndMovementKeyword, GCHeapCollectKeyword, GCHeapAndTypeNamesKeyword and GCSampledObjectAllocationLowKeyword activated. The numbers and flamegraph @chrisnas provided you was due to the activation of GCSampledObjectAllocationLowKeyword and GCSampledObjectAllocationHighKeyword keywords.

I also worked on generating code that would allow checking a combination of provider/keyword/level. We can configure the mechanism using a string similar to what EventPipe uses. In my case, I stored this string in an environment variable (for simplicity). We can replace the environment variable by something similar to what has been done for the EventPipe: having a thread that looks for the existence of a configuration file and updates Runtime LTTng configuration. This file would be generated by the perfcollect script when starting collecting events.

Note: Not all events have a check on provider/keyword/level before preparation and emission. At some point, it would be nice to have the same logic everywhere (e.g Contention event)

noahfalk commented 5 years ago

Thanks @gleocadie

@sywhang special-cased the GC part by identifying specific events which were unique to keyword/level combinations and I was expecting we'd leave that as-is (https://github.com/dotnet/coreclr/pull/22707). I think of this approach in general as doing the simple thing (ORing the events) when perf testing shows no impact and then be willing to special case with more complex logic the areas that were negatively impacted.

We can replace the environment variable by something similar to what has been done for the EventPipe

If you can indulge a quick hypothetical... : ) Imagine we changed perfcollect so it used EventPipe instead of LTTng, and now you could use EventPipe's configuration mechanisms to control what got logged. It would generate a .netperf file instead of LTTng's trace format, but TraceEvent still parses it as before and gives the same set of events for you to analyze in PerfView or in custom tools. Does that sound like it would solve your team's concerns about configuration? Does it introduce new concerns about something LTTng was giving you that EventPipe would not?

sywhang commented 5 years ago

dotnet/coreclr#25221 and dotnet/coreclr#25374 should've resolved this issue. I filed https://github.com/dotnet/coreclr/issues/25421 to track a simple remaining work item I want to get in by 3.0 ZBB.

dotnet / runtime

LTTNG/EventPipe filtering occurs too late, causes heavy perf overhead #12204