DataDog / dogstatsd-csharp-client

A DogStatsD client for C#/.NET
datadoghq.com
Other
94 stars 62 forks source link

Blocking calls within dogstatsd client #158

Closed thomas-oliver-cko closed 2 years ago

thomas-oliver-cko commented 3 years ago

We have recently profiled the performance of our application and we can see that the dogstatsd package is taking a surprising amount of CPU time. When we looked deeper, we found that the dogstatsd is reserving an entire thread to poll a ConcurrentQueue for metrics. This tread polls the queue, and then waits. This wait time is wasted CPU time, and for high performance applications is a waste of resources. The main culprit is:

namespace StatsdClient.Worker
{
    internal class Waiter : IWaiter
    {
        public void Wait(TimeSpan duration)
        {
            Task.Delay(duration).Wait();
        }
    }
}

but Task.Wait() is called in multiple places within the package. By moving to async await, or by using some sort of event based timer, you release that thread back to the thread pool so it can be used by the main application. Can we get some thoughts on if/how this can be solved? We are very interested in seeing this blocking wait removed

ogaca-dd commented 3 years ago

Hello,

the dogstatsd package is taking a surprising amount of CPU time

Do you have some numbers to share (CPU usage and number of metrics sent)? How did you measure it? Did you measure the CPU usage, the time spent in the functions or the waiting time? Are you working in a very contraint environment with very low memory, very low cpu?

This wait time is wasted CPU time, and for high performance applications is a waste of resources

Having a single thread that blocks has a very low resource impact and is usually not a problem. One of the reason to use async is to avoid blocking 1000 threads when performing 1000 HTTP requests. Blocking 1000 threads has a resource impact.

I did not use async for latency reason. Consider the following scenario where async is used for Wait:

Note: When the number of DogStatsd methods called is low, the CPU usage should be closed to 0 (But the waiting time should be very high).

thomas-oliver-cko commented 3 years ago

This particular test was run on an Quad core Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz 2.59 GHz with 16GB RAM. Our tool was handling 100 incoming HTTP Requests Per Second (rps), and performing 300 outgoing HTTP rps. We profiled it using JetBrains dotTrace. In the test, our application takes an average of 23.6% of the CPU. Of the CPU used, 4.83% of CPU time is spent waiting in the dogstatsd client

tempsnip

Are you working in a very contraint environment with very low memory, very low cpu?

We have the option of scaling horizontally, and our applications have enough CPU and Memory allocated to them to run as they are. However, when around 5% CPU time under moderate load is spent waiting, it does make us pause and think.

The application is not sending any metrics and the background thread run Wait. It releases the thread to the thread pool The application starts using a lot of threads and keep them for long. As no thread is available, the async Wait still wait for an available thread. => The whole DogStatsD pipeline is blocked. The metrics are not sent and the new metrics start to be dropped.

In regards to thread starvation, if our application is suddenly under heavy load, we want all resources dedicated to serving customers. That is one use for the concurrent queue, so that if the application is under load, metrics can be buffered into memory for a period of time so they can be sent later. If thread starvation continues and we haven't yet scaled out, I am much happier dropping metrics then I am letting down customers.

Using async await or some sort of eventing trigger will only cause metrics to be dropped under high load

kevingosse commented 3 years ago

My two cents:

  1. Using a dedicated thread for sending metrics has pros and cons. Here, this is done to ensure that metrics are still pushed even if the threadpool is overloaded. Whether it's a good or bad thing is a debate of its own.

  2. The CPU load is increased specifically because of the Task.Wait function, which internally uses synchronization and a non-negligible amount of spinning. Switching to Thread.Sleep (https://github.com/DataDog/dogstatsd-csharp-client/pull/159) should solve this point in particular.

thomas-oliver-cko commented 3 years ago

Thanks @kevingosse. Thread.Sleep does block a thread, but it does yield CPU to other threads for the duration of the sleep which will certainly fix this problem, see here. I'm happy for this issue to be resolved by #159. We'll run another profile once this change is out and confirm it does what it says on the tin.

ogaca-dd commented 3 years ago

@thomas-oliver-cko : I will also try to reproduce the behavior and try to check if #159 solve the issue. What .NET runtime are you using?

thomas-oliver-cko commented 3 years ago

@ogaca-dd we are running netstandard 2.1

ogaca-dd commented 3 years ago

@thomas-oliver-cko DogStatsD 7.0.0 is out. It includes #159. Do you still have the issue with the latest version?

ogaca-dd commented 2 years ago

Closed as there is no recent activity on this issue. Feel free to reopen an issue if you still have the problem.