dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

EventPipe wakes up every 5 seconds #11638

Closed jkotas closed 4 years ago

jkotas commented 5 years ago

EventPipe is unconditionally creating Timer during startup that wakes up every 5 seconds: https://github.com/dotnet/coreclr/blob/b6c346828eb393d93dd6204a1e88a651e68cdb3e/src/System.Private.CoreLib/src/System/Diagnostics/Eventing/EventPipeController.cs#L87

This behavior is really bad for both client (kills battery life) and servers (kills density).

jkotas commented 5 years ago

This was introduced by dotnet/coreclr#20238.

@brianrob @vancem Is this a bug? (I hope so...)

vancem commented 5 years ago

Our desire for EventPipe in .NET core is to be controlled via a named pipe (see https://github.com/dotnet/coreclr/issues/21027). However that was considered too difficult to fit into V2.2, so we looked for 'easy' ways of getting as much as we could for V2.2 The idea was to use a very simple file-based prototocol for controlling tracing. The result was PR dotnet/coreclr#20238.

It is expected that in the V3.0 release we would move to the named pipe design and this issue would go away.

The polling in the V2.2. design was to allow tracing of already started processes. We considered the FileSystemWatcher API, but you run into a depencency problem (FileSystemWatcher is not System.Private.Corelib), and given that this is not our long term design and the time constraint on implemenation, we rejected that.

Originally our design was that you had to set an environment variable to get the polling, however, relatively late in the design we considered that not having it on 'by default' was a rather important stumbling block and switched the default.

We did know polling like this is does have negative consequences, but those consequences are inversely proportional to polling interval, and the polling interval is pretty long (My original suggestion was every 10 seconds, and I could live with something as large as 30 seconds).

I judged the overhead to be small enough to be acceptable, given that the problem will be fixed as part of https://github.com/dotnet/coreclr/issues/21027.

If you have data/insight into the problem being bigger that I realize, please let us know. V2.2 has shipped but if you think it is serious we could increase the polling interval/and/or provide an opt-out if you want.

But for V3.0 the problem should go away as part of https://github.com/dotnet/coreclr/issues/21027.

jkotas commented 5 years ago

the problem being bigger that I realize

For server workloads, I expect that this will show up as a problem once .NET Core 2.2 starts being used by the freemium Azure Websites and other similar high-density workloads. They are heavily optimized around most processes sitting idle and paged out for extensive periods of time. It is common to see many thousands of processes packaged onto a single machine. Once each of these thousands of processes wakes up every 5 or 30 seconds, I expect that it is going to affect their performance profile a lot.

benaadams commented 5 years ago

Don't need to capture the ExecutionContext onto the timer? https://github.com/aspnet/AspNetCore/issues/4731#issuecomment-447567359

image

/cc @stephentoub

vancem commented 5 years ago

I talked with @mjsabby and it looks like the Bing team noticed the pings @mjsabby can you document the impact to your system so we can prioritize.

As mentioned this issue goes away naturally in V3.0 when we implement dotnet/runtime#11481, but I think we need an opt-out immediately.

I agree that it is trivial to exclude the execution context that @benaadams found so that should be done as well.

vancem commented 5 years ago

It looks like @benaadams ExecutionContext issue has already been fixed in master branch

https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/src/System/Diagnostics/Eventing/EventPipeController.cs#L111

benaadams commented 5 years ago

It looks like @benaadams ExecutionContext issue has already been fixed in master branch

I'm a bit confused as it looks like the timer always had flowExecutionContext: false; so not sure why the connection to EC

mjsabby commented 5 years ago

We've blocked rollout of 2.2 to our servers after we noticed a rise in TLB miss rate. It turns out we had lots of dormant processes that were now waking up. I don't have CSwitch data but I was able to confirm that for a period of 30 seconds no other thread was waking up besides the EventPipe timer.

If there is already a desire to fix this, I would imagine mitigating with an opt-out soon would be a good idea.

jkotas commented 5 years ago

I think we should service 2.2 to have this opt-in. I think the severity of the problem that this is introducing outweighs the convenience. I expect that every workload that matters will run into this issue.

mjsabby commented 5 years ago

Can we have this fixed soon?

jorive commented 5 years ago

@jkotas @vancem Can this issue be closed? Workaround were merged into master and release/2.2.

jkotas commented 5 years ago

Assuming that this is going to be fixed properly by dotnet/runtime#11481 in 3.0, it is ok to close this one.

tommcdon commented 5 years ago

We are fixing dotnet/runtime#11481 in 3.0, so closing this issue