getsentry / sentry-dotnet

Sentry SDK for .NET
https://docs.sentry.io/platforms/dotnet
MIT License
574 stars 205 forks source link

Increased CPU and memory overhead while profiling #3199

Open InterstellarStella opened 4 months ago

InterstellarStella commented 4 months ago

Package

Sentry.AspNetCore

.NET Flavor

.NET

.NET Version

.Net 7 / .Net 8

OS

Any (not platform specific)

SDK Version

4.1.2

Self-Hosted Sentry Version

No response

Steps to Reproduce

Enable profiling through the following setup:

SentrySdk.Init(options =>
{
    options.Dsn = sentryConfig.Dsn;
    options.TracesSampleRate = 0.1;
    options.ProfilesSampleRate = 0.1;
    options.Environment = sentryConfig.Environment;
    options.DiagnosticLevel = diagnosticLevel;
    options.AddIntegration(new ProfilingIntegration( TimeSpan.FromMilliseconds(500)));
});

Expected Result

The CPU overhead is about +5 to 10%.

Actual Result

The CPU overhead is around around +70%, which is seen through google cloud monitoring system. The following Sentry packages are used: Sentry 4.1.2 Sentry.AspNetCore 4.1.2 Sentry.Profiling 4.1.2

Slack conversation Zendesk ticket

bruno-garcia commented 4 months ago

cc @vaind

Could we have some details about what the app is doing? With a repro we can look exactly what's going on and make improvements but with the currently provided information it's hard for us to figure out where the problem is.

Is this a web server? How many requests in parallel are running?

If this has high thrughput, changing sample rate to:

    options.TracesSampleRate = 0.001;
    options.ProfilesSampleRate = 0.001;

Does that reduce the CPU overhead considerably?

saschash commented 4 months ago

@bruno-garcia I tried to put sample rates to 0.001 and there is still quite a significant spike (from 7 % CPU -> 40 % CPU) In fact, it was about the same as with sampling rate 1.0, which is weird... The behaviour is, it first jumps to 60% and then stabilises at 40%. It is a web server, but in sandbox environment, and we are running 20 workers in parallel

I have mentioned before that we are using hang fire to queue background jobs and I have noticed call like that: Sentry.AspNetCore.SentryTracingMiddleware.InvokeAsync(HttpContext context). But when it is a hangfire background job HttpContext will be null, so that might somehow explain what is going on

bruno-garcia commented 4 months ago

I tried to put sample rates to 0.001 and there is still quite a significant spike (from 7 % CPU -> 40 % CPU) In fact, it was about the same as with sampling rate 1.0, which is weird...

This didn't come up during our tests so I'm a bit surprised but I wonder if it might be the overhead of the .NET profiler itself which unfortunately has a high frequency and we down sample after the fact.

I have mentioned before that we are using hang fire to queue background jobs and I have noticed call like that: Sentry.AspNetCore.SentryTracingMiddleware.InvokeAsync(HttpContext context). But when it is a hangfire background job HttpContext will be null, so that might somehow explain what is going on

I wonder if this is to get the transaction name automatically.

Do you get the overhead without Profiling turned on? Just keep transactions (TracesSampleRate set to some value)

By any chance in the profiles in Sentry, do you see any frames from Sentry that seem to be taking significant part time?

saschash commented 4 months ago

Thanks @bruno-garcia. Without profiling it works like a charm without any noticable overhead. I could not find anything from Sentry taking a lot of time, in fact I couldn't see any sentry frames at all - which is weird, right, I should be able to see sentry-related process there...

bruno-garcia commented 4 months ago

sorry for the delay here. any chance you can get a repro to help us debug this? @vaind any ideas to further investigate?

vaind commented 4 months ago

I've tried reproducing this locally with our aspnetcore sample and ab but haven't seen any significant overhead from enabling profiling.

I've launched the example with dotnet run -c Release with this branch and then tested various ab confs. E.g. with ab -c 10 -n 1000 (10 concurrent requests, 1000 total request), the CPU usage I'm seeing on my PC be it with or without profiling is about 60 % and the average response time is 80 ms without profiling and about 82 ms with profiling 100 % transactions.

Some reproducable example would be of great help here. Or any more details on what is actually happening. Maybe you can run dotnet-trace manually (with Sentry profiling disabled) and see if that has any effect on the CPU usage?

saschash commented 3 months ago

Thanks @vaind. I will try to reproduce it on some minimum functionality branch - maybe will be able to catch it ...

haneytron commented 1 month ago

I just ran into this issue also. Last night I installed the Sentry profiler. With profiling enabled at 0.2 with tracing at 0.2 my app which normally runs at 5-10% CPU was redlining at 100% to the point where HTTP requests were timing out and I was getting 502 gateway errors and health check alerts.

I disabled the profiler (keeping only tracing) and things returned to normal. There's definitely something up with the .NET Core profiler.

My app is a web app, .NET Core 8, hosted on Azure, uses standard MVC + some SignalR for live data feeds. Backed by MSSQL.

image

In the above image you can pinpoint where I deployed the profiler, and also where I just turned it off.

Here's my Sentry config:

// Sentry monitoring
builder.WebHost.UseSentry((ctx, sentryConfig) =>
{
    sentryConfig.Dsn = "<REDACTED>";
    // Don't need debug messages
    sentryConfig.Debug = false;
    // Set TracesSampleRate to 1.0 to capture 100% of transactions for performance monitoring. We recommend adjusting this value in production
    sentryConfig.TracesSampleRate = 0.2;
    // Sample rate for profiling, applied on top of the TracesSampleRate e.g. 0.2 means we want to profile 20 % of the captured transactions. We recommend adjusting this value in production.
    //sentryConfig.ProfilesSampleRate = 0.2;
    // Note: By default, the profiler is initialized asynchronously. This can be tuned by passing a desired initialization timeout to the constructor.
    //sentryConfig.AddIntegration(new ProfilingIntegration(
        // During startup, wait up to 500ms to profile the app startup code. This could make launching the app a bit slower so comment it out if you prefer profiling to start asynchronously.
        //TimeSpan.FromMilliseconds(500)
    //));
});
bitsandfoxes commented 1 month ago

We've addressed a couple of issues, most notably the bump to perfview. Could you give 4.7.0 a try and see if this has an effect?

saschash commented 1 month ago

I spent quite some time trying to reproduce it on a minimum environment, but without any success ...

bitsandfoxes commented 1 month ago

without any success

As in: No noticeable improvement?

saschash commented 1 month ago

No, I mean I couldn't reproduce the problem

jamescrosswell commented 1 month ago

I think this was fixed in https://github.com/getsentry/sentry-dotnet/pull/3382 then.

Feel free to re-open if you're still having issues with the 4.7.0 release or later though.

haneytron commented 1 month ago

I'm re-enabling and redeploying now at 0.2 and 0.2. Code:

// Sentry monitoring
builder.WebHost.UseSentry((ctx, sentryConfig) =>
{
    // For Sentry config later on
    var sentryOptions = new SentryOptions();
    ctx.Configuration.GetSection("SentrySettings").Bind(sentryOptions);

    sentryConfig.Dsn = <REDACTED>;
    // Don't need debug messages
    sentryConfig.Debug = false;
    // Set TracesSampleRate to 1.0 to capture 100% of transactions for performance monitoring. We recommend adjusting this value in production
    sentryConfig.TracesSampleRate = 0.2;
    // Sample rate for profiling, applied on top of the TracesSampleRate e.g. 0.2 means we want to profile 20 % of the captured transactions. We recommend adjusting this value in production.
    sentryConfig.ProfilesSampleRate = 0.2;
    // Note: By default, the profiler is initialized asynchronously. This can be tuned by passing a desired initialization timeout to the constructor.
    sentryConfig.AddIntegration(new ProfilingIntegration(
        // During startup, wait up to 500ms to profile the app startup code. This could make launching the app a bit slower so comment it out if you prefer profiling to start asynchronously.
        TimeSpan.FromMilliseconds(500)
    ));
});

I'll report back in a few hours on the results.

haneytron commented 1 month ago

Appears to still be an issue with 4.7.0 - turning it back off for now.

image

jamescrosswell commented 1 month ago

@haneytron is there any way you could help us reproduce this? Running things locally with 4.7.0, the memory usage with profiling enabled has been stable and minimal.

haneytron commented 1 month ago

@jamescrosswell I can certainly do my best. The app in question is a production app that I can't really be experimenting with. But I can probably temporarily spin up a second slot that isn't registered to our primary domain and we can toy with that.

Intuitively, I think a few areas that are possibly causing the issue are:

I could try disabling each one and see what that does to performance, as a starting point. This breaks the functionality of the app, but of course we wouldn't care about that in a temporary deployment of the instance not being linked to prod traffic. Let me know your thoughts.

jamescrosswell commented 1 month ago

@haneytron that would be great... anything really to try to narrow it down to something we might be able to reproduce.

If it's possible, something else that would probably help is to get a memory dump using something like dotnet-gcdump.

haneytron commented 1 month ago

@jamescrosswell sorry for the delay. I'll be testing this today.

haneytron commented 1 month ago

I've deployed to a dedicated Azure AppService (Linux) slot. I've enabled profiling per the settings above (0.2/0.2) and am now waiting for the app to repro the perf issues. Once that happens I'll start trying variables (first up disabling SignalR usage).

haneytron commented 1 month ago

@jamescrosswell yeah it's spinning out of control. I'm going to let it run for an hour or two until it starts timing out requests, then capture a memory dump. I'm not comfortable providing that in public given this is a private app and repo, under copyright, but I would be happy to supply an authenticated gdrive link if you can provide me an email address.

haneytron commented 1 month ago

So after a few hours of literally 0 traffic here's the memory pattern and CPU pattern.

Memory:

image

The peaks here are 600MB of working set, my app normally runs sub-200. Looks like it eventually GCs.

CPU:

image

The peaks here are 55% CPU or so.

It cycles back to 0 when memory does. That seems to happen almost every 90 minutes on the nose. My app runs at about 5% CPU when I'm not profiling.

I can't get a memory dump because Azure reasons. I will now redeploy with SignalR disabled and see what happens.

haneytron commented 1 month ago

Disabling SignalR didn't change things. I'll try disabling the HostedService classes next.

image

Those drops in CPU are every 45 minutes almost on the dot. Probably an Azure hosting thing but mentioning it in case there's some "45 minute" timespan in the Sentry code somewhere and this gives a clue.

haneytron commented 1 month ago

No change disabling all HostedService instances either. I'm going to disable profiling and see what the baseline CPU is.

bitsandfoxes commented 1 month ago

Thanks for the continuous trying and updates @haneytron.

haneytron commented 1 month ago

So removing the profiling drops CPU back to normal. I think the best thing I can try next is just to deploy a very basic app (literally nothing but a hello world in MVC) to the same slot and see what happens.

image

haneytron commented 1 month ago

I've hacked my app down to literally the bare bones, a single route that just shows a 404 page, and the issues persist. Here are some interesting graphs from today:

Graphs

CPU - you can see pretty clearly where I first fired it up, and also where I disabled profiling, then later re-enabled it.

image

Memory - about double the usual working set when profiling is active.

image

Gen 0, 1, 2 GC are not exciting or out of the ordinary.

image

Server Errors: Ignore the initial spike, I had things configured incorrectly. 0 errors all day.

image

Total (HTTP) requests - again not exciting. Not a prod site instance, only traffic is me randomly playing.

image

Thread Count: not exciting, fairly stable.

image

Average Response Time - VERY exciting!

Consistently low with profiling off, 30+ second bursts when on.

image

What's Next

Two things next.

1. I'm gonna enable it on my prod site again, right now.

I'll leave it until it explodes or 24 hours goes by, and we'll see if things are working properly and well OR the issue actually remains and these graphs from Azure Portal are garbage. Honestly, they might be, perf counters are funky at best in my experience.

I'll report back tomorrow with the results!

2. You can try replicating my setup

You could also try to replicate my app and hosting env:

.NET Core 8 (MVC Web App) Azure hosted, app service, Linux, B1 Basic plan (~$13 USD/month), US East 2 region For this demo I manually published, no CI/CD

Here is the entirely of my current Program.cs

using CodeSession;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using Sentry.Profiling;
using System;

var builder = WebApplication.CreateBuilder(args);

var isDev = builder.Environment.IsDevelopment();

builder.Host.ConfigureServices((ctx, services) =>
{
    // Enable MVC
    services.AddControllersWithViews(options =>
    {
        options.SuppressAsyncSuffixInActionNames = true;
        options.EnableEndpointRouting = true;
    });
    if (!isDev)
    {
        services.AddHttpsRedirection(options =>
        {
            // Prevents an error that occurs in the TRACE log and reports to Azure infrequently, seemingly at app startup
            options.HttpsPort = 443;
        });
    }
});

// Sentry monitoring
builder.WebHost.UseSentry((ctx, sentryConfig) =>
{
    // For Sentry config later on
    var sentryOptions = new SentryOptions();
    ctx.Configuration.GetSection("SentrySettings").Bind(sentryOptions);

    sentryConfig.Dsn = sentryOptions.Dsn;
    // Don't need debug messages
    sentryConfig.Debug = false;
    // Set TracesSampleRate to 1.0 to capture 100% of transactions for performance monitoring. We recommend adjusting this value in production
    sentryConfig.TracesSampleRate = 0.2;
    // Sample rate for profiling, applied on top of the TracesSampleRate e.g. 0.2 means we want to profile 20 % of the captured transactions. We recommend adjusting this value in production.
    sentryConfig.ProfilesSampleRate = 0.2;
    // Note: By default, the profiler is initialized asynchronously. This can be tuned by passing a desired initialization timeout to the constructor.
    sentryConfig.AddIntegration(new ProfilingIntegration(
        // During startup, wait up to 500ms to profile the app startup code. This could make launching the app a bit slower so comment it out if you prefer profiling to start asynchronously.
        TimeSpan.FromMilliseconds(500)
    ));
});

// Build it
var app = builder.Build();

// Redirect from HTTP to HTTPS
app.UseHttpsRedirection();

app.UseRouting();

// Default to 404
app.MapFallbackToController(action: "NotFound", controller: "Error");

// Run it
await app.RunAsync();
haneytron commented 1 month ago

Welp, bad news this AM. Multiple 502 and 503 errors (with NOTHING logged to console, app logs, or HTTP logs in Azure in terms of details), and multiple app restarts (as a result of my health checks I think) as well.

500 errors, starting when I deployed last night just after my last update:

image

Here are GSheets of the Console and HTTP logs for last 24 hours, limited to last 1000 logs, and all PII scrubbed:

https://docs.google.com/spreadsheets/d/1dqrHKF8DrBACtgoaQYxb7H59-8FO7SQhdYfVHogaZNI/edit#gid=940056293 https://docs.google.com/spreadsheets/d/1Wb9ZyWwiMUUplf6vOyNb37s7V6W5Q-Lvcj1YIErP2ek/edit#gid=869357884

Note that all of the 503, 502, 499s happen with the SignalR endpoints (/session/live/* routes) so my best guess is SignalR / WebSocket usage is an issue with the profiler.

Let me know how I can be of further help. Disabling profiling for now.

vaind commented 1 month ago

Thx for all the updates. Any chance you could try running dotnet-trace profiler (with Sentry builtin profiling disabled at the time) on your app and see how that influences cpu usage? https://learn.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-trace#collect-a-trace-with-dotnet-trace

haneytron commented 1 month ago

@vaind ran a trace for almost 3 hours, ending just now:

root@b4ee333c5f1c:~/site/wwwroot# dotnet-trace collect --profile cpu-sampling -p 106 -o /dev/null

Provider Name                           Keywords            Level               Enabled By
Microsoft-DotNETCore-SampleProfiler     0x0000F00000000000  Informational(4)    --profile 
Microsoft-Windows-DotNETRuntime         0x00000014C14FCCBD  Informational(4)    --profile 

Process        : /usr/share/dotnet/dotnet
Output File    : /dev/null
[00:02:50:37]   Recording trace 0.00     (B)
Press <Enter> or <Ctrl+C> to exit...     (B)
Stopping the trace. This may take several minutes depending on the application being traced.

Trace completed.

(output to dev/null to avoid file storage issues and I/O penalties)

Result in CPU was high, but also profiling @ 100%:

image

Response times remained good (the longer ones are incorrect measuring on long-lived WebSockets SignalR conns):

image

No 5xx or 499 thrown during this period:

image

Average memory working set was constant w/ profiler active (no 2x jump like with the Sentry profiler):

image

Console / App / HTTP logs are clean. No app service restarts since I turned off the Sentry profiler yesterday afternoon.

I suspect SignalR / WebSockets issues w/ the Sentry profiler remain.

vaind commented 1 month ago

I've identified at least one memory leak in perfview EventPipe processing where it calls GC.KeepAlive() on objects so they never get freed. However, I can still see the memory growing, although not on the managed side so there may be something else going on as well.

vaind commented 3 weeks ago

TLDR: not resolved yet

I'll post a PR decreasing the circular buffer we use when setting up a diagnostic session but other than that, the main problem seems to be with object pinning in perfview and changing that would require changing quite a few things. In short, EventPipeScource doesn't copy incoming buffer but instead maps structs (EventPipeEventHeader) directly onto the buffer (unsafe but OK if pinned). To amplify the problem, parts of these structs are then referenced directly (as pointers) and it's not clear to me whether objects containing these copied pointers would live outside the event handles (it seems like they would). I'm not sure about the scope of the changes needed to make this work without pinning (or without a pinning handle leak while still staying safe), but I have a feeling it would require a lot of effort.