DataDog / dd-trace-dotnet

.NET Client Library for Datadog APM
https://docs.datadoghq.com/tracing/
Apache License 2.0
452 stars 140 forks source link

CPU overhead in Context.Run for ASM #4017

Closed Xeinaemm closed 1 year ago

Xeinaemm commented 1 year ago

Hey,

My client's product serves billions of requests and today after investigation we figured out that enabling ASM DD_APSEC_ENABLED = 1 results in CPU overhead. Roughly ~8-11% CPU. Everything points to Context.Run and its most expensive method from all listed in profiles, the second one is lock. It runs in parallel with .NET Core middlewares and has a significant impact on every HTTP call: https://github.com/DataDog/dd-trace-dotnet/blob/1b6bbaa8f3d11ef8fc01c3b6ae9de37458bd9e61/tracer/src/Datadog.Trace/AppSec/Waf/Context.cs#L58 Params and underlying dictionaries are fine, they take less than 1% CPU but Context.Run itself takes on average 8% of CPU.

I'm not worried about CPU usage, more likely increased latency and CPU contention. On well-balanced infrastructure, 70% of operations wait in a CPU queue(CPU contention) and result in 5-15 ms increased latency. On not well-balanced infrastructure where demand increases, it performs even worse because CPU contention increases and latency skyrockets. One container uses 1vCPU and serves ~41 threads. ASM adds on average 3-5 threads.

The product is built on Alpine and runs on AWS Fargate and has enabled almost every feature provided by DataDog. We are running on version 2.27.0, and fixed the problem with logging debug data a.k.a DD_TRACE_DEBUG=1

I cannot prove it will be faster than the current implementation but it can be rewritten in a way where ChannelReader/ChannelWriter will be implemented as a job(context) queue processor on the defined thread which will handle it in a steady manner without locks. Comments like "it is not safe for concurrent access" will disappear.

anna-git commented 1 year ago

Hi Piotr, Thanks for reporting this issue. We are aware that Context.Run is one of the main CPU consumer, as it performs most security processings, but these numbers seem to be a bit high. We'll be running some verifications on Alpine regarding the CPU impact / added processing time. Could I ask what version of NET Core you are running on and what tool helped you get these measurements? The lock you see should be of no impact as it would provoke contention only on very rare cases as the Context object is request scoped. It is here to protect these very rare cases where there would be several threads within the same request wanting to run security at the same time. Most of the time, the lock would just give way immediately with no wait, as there would only be one thread (most likely the same) accessing it. Is the product always running with DD_TRACE_DEBUG=1? This might also add overhead as more frequent I/O operations are performed to write additional logs.

Xeinaemm commented 1 year ago

Hey,

I've rechecked it on .NET Core 3.1 and .NET Core 6 for production and performance environments. I don't see a lot of locks related to DataDog (4 locks per 1 million requests). Most locks are related to third parties and console output because we use AWS Firelens + Serilog to gather logs by DataDog, which is out of the scope of this thread.

DD_TRACE_DEBUG is disabled.

We run containers very low in the case of CPU, with ~12% usage to reduce CPU contention and decrease latency. It points out that Contex.Run takes 7% execution CPU time based on 1000 DataDog profiles(Continuous Profiler) for all endpoints, for CLR traces it's only 0,4%. Roughly ~0,8% of the container CPU from 12%.

If you know that Context.Run is the main CPU consumer I don't know what to add more here. Maybe in the future, it will be better optimized to not interfere with hot path execution because it cost a lot :-)

anna-git commented 1 year ago

Hello, Thanks for the additional information. Optimizing the run of our security scans is part of our backlog and future versions should show some improvements but it's unlikely we will be able to make drastic improvements. Note here that detecting threats is CPU intensive by nature, as many string operations have to run and regexes need to be executed against some of the request's information. As ASM is also able to block attacks, i.e stop the request's flow as soon as possible before it reaches any valuable resource, running on the request's hotpath is one of the requisites. I'll go ahead and close this issue for now, but we won't forget to take it into account when working on ASM optimizations in the near future, thank you.