dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

Lock convoy on TimerQueueTimer #8646

Closed maximburyak closed 4 years ago

maximburyak commented 7 years ago

Hi, Here: https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Threading/Timer.cs#L555 you lock on a static field. Under a certain load of timers, you can get a very annoying lock convoy. It's not that easy to reproduce, but it happes. There's a bigger chance to reproduce especially under big amount of thread pool threads, so there are a lot of these stacks:

System.Threading.TimerQueueTimer.Fire() System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() System.Threading.ThreadPoolWorkQueue.Dispatch()

preventing from themselves and other timers to operate or even being created

kouvel commented 7 years ago

It's possible, but usually there is another problem that leads to this. See my response here: https://github.com/dotnet/coreclr/issues/12410. Are you sure it's not any of those issues?

Lock convoy issues in the monitor are covered here: https://github.com/dotnet/coreclr/issues/11979

maximburyak commented 7 years ago

Hi, thanks for the prompt response! Similiarly to the referenced issue, the reason is a big amount of timers. We've set a high value to minThreadCount in thread pull(300 for each type) in order to allow ourselves to deal with big bursts of work, due to thread pool's slow response to load. I hope that you guys will consider changing that implementation.

kouvel commented 7 years ago

@maximburyak, why was it necessary to increase the min thread count? Typically, this is only necessary if there is a lot of blocking work happening on thread pool threads. If your timer callbacks don't block, the default thread count should give better throughput.

kevingosse commented 7 years ago

Not sure if that's the issue you're hitting, but timers scale really badly, especially with CPUs that have a large number of cores.

Put together, we have seen the performance of low-latency services (< 50 ms) running on 32 cores servers completely collapse because of timers. Basically, there's a certain threshold where the low interval (so higher frequency of operations needing the lock) coupled with the number of timers (so higher duration of the operation inside of the lock) starts causing contention. When it does, the spinning eats large amounts of CPU, and everything crumbles down.

We bypassed the issue by using our own timer implementation (based on the timer wheel algorithm), but I wonder if something could be done globally.

The thing is, a service application can quickly start using a large number of timers without realizing it (Task.Delay, CancellationTokenSource.CancelAfter, ...).

benaadams commented 6 years ago

Resolved by Reduce Timer lock contention dotnet/coreclr#14527 ?

benaadams commented 6 years ago

Also Improve Monitor scaling https://github.com/dotnet/coreclr/pull/14216 though that's not be merged yet

stephentoub commented 6 years ago

Is there a repro for this we could try with the fixes?