Open InterstellarStella opened 4 months ago
cc @vaind
Could we have some details about what the app is doing? With a repro we can look exactly what's going on and make improvements but with the currently provided information it's hard for us to figure out where the problem is.
Is this a web server? How many requests in parallel are running?
If this has high thrughput, changing sample rate to:
options.TracesSampleRate = 0.001;
options.ProfilesSampleRate = 0.001;
Does that reduce the CPU overhead considerably?
@bruno-garcia I tried to put sample rates to 0.001 and there is still quite a significant spike (from 7 % CPU -> 40 % CPU) In fact, it was about the same as with sampling rate 1.0, which is weird... The behaviour is, it first jumps to 60% and then stabilises at 40%. It is a web server, but in sandbox environment, and we are running 20 workers in parallel
I have mentioned before that we are using hang fire to queue background jobs and I have noticed call like that: Sentry.AspNetCore.SentryTracingMiddleware.InvokeAsync(HttpContext context). But when it is a hangfire background job HttpContext will be null, so that might somehow explain what is going on
I tried to put sample rates to 0.001 and there is still quite a significant spike (from 7 % CPU -> 40 % CPU) In fact, it was about the same as with sampling rate 1.0, which is weird...
This didn't come up during our tests so I'm a bit surprised but I wonder if it might be the overhead of the .NET profiler itself which unfortunately has a high frequency and we down sample after the fact.
I have mentioned before that we are using hang fire to queue background jobs and I have noticed call like that: Sentry.AspNetCore.SentryTracingMiddleware.InvokeAsync(HttpContext context). But when it is a hangfire background job HttpContext will be null, so that might somehow explain what is going on
I wonder if this is to get the transaction
name automatically.
Do you get the overhead without Profiling turned on? Just keep transactions (TracesSampleRate
set to some value)
By any chance in the profiles in Sentry, do you see any frames from Sentry that seem to be taking significant part time?
Thanks @bruno-garcia. Without profiling it works like a charm without any noticable overhead. I could not find anything from Sentry taking a lot of time, in fact I couldn't see any sentry frames at all - which is weird, right, I should be able to see sentry-related process there...
sorry for the delay here. any chance you can get a repro to help us debug this? @vaind any ideas to further investigate?
I've tried reproducing this locally with our aspnetcore sample and ab
but haven't seen any significant overhead from enabling profiling.
I've launched the example with dotnet run -c Release
with this branch and then tested various ab
confs. E.g. with ab -c 10 -n 1000
(10 concurrent requests, 1000 total request), the CPU usage I'm seeing on my PC be it with or without profiling is about 60 % and the average response time is 80 ms without profiling and about 82 ms with profiling 100 % transactions.
Some reproducable example would be of great help here. Or any more details on what is actually happening. Maybe you can run dotnet-trace
manually (with Sentry profiling disabled) and see if that has any effect on the CPU usage?
Thanks @vaind. I will try to reproduce it on some minimum functionality branch - maybe will be able to catch it ...
I just ran into this issue also. Last night I installed the Sentry profiler. With profiling enabled at 0.2 with tracing at 0.2 my app which normally runs at 5-10% CPU was redlining at 100% to the point where HTTP requests were timing out and I was getting 502 gateway errors and health check alerts.
I disabled the profiler (keeping only tracing) and things returned to normal. There's definitely something up with the .NET Core profiler.
My app is a web app, .NET Core 8, hosted on Azure, uses standard MVC + some SignalR for live data feeds. Backed by MSSQL.
In the above image you can pinpoint where I deployed the profiler, and also where I just turned it off.
Here's my Sentry config:
// Sentry monitoring
builder.WebHost.UseSentry((ctx, sentryConfig) =>
{
sentryConfig.Dsn = "<REDACTED>";
// Don't need debug messages
sentryConfig.Debug = false;
// Set TracesSampleRate to 1.0 to capture 100% of transactions for performance monitoring. We recommend adjusting this value in production
sentryConfig.TracesSampleRate = 0.2;
// Sample rate for profiling, applied on top of the TracesSampleRate e.g. 0.2 means we want to profile 20 % of the captured transactions. We recommend adjusting this value in production.
//sentryConfig.ProfilesSampleRate = 0.2;
// Note: By default, the profiler is initialized asynchronously. This can be tuned by passing a desired initialization timeout to the constructor.
//sentryConfig.AddIntegration(new ProfilingIntegration(
// During startup, wait up to 500ms to profile the app startup code. This could make launching the app a bit slower so comment it out if you prefer profiling to start asynchronously.
//TimeSpan.FromMilliseconds(500)
//));
});
We've addressed a couple of issues, most notably the bump to perfview. Could you give 4.7.0
a try and see if this has an effect?
I spent quite some time trying to reproduce it on a minimum environment, but without any success ...
without any success
As in: No noticeable improvement?
No, I mean I couldn't reproduce the problem
I think this was fixed in https://github.com/getsentry/sentry-dotnet/pull/3382 then.
Feel free to re-open if you're still having issues with the 4.7.0 release or later though.
I'm re-enabling and redeploying now at 0.2 and 0.2. Code:
// Sentry monitoring
builder.WebHost.UseSentry((ctx, sentryConfig) =>
{
// For Sentry config later on
var sentryOptions = new SentryOptions();
ctx.Configuration.GetSection("SentrySettings").Bind(sentryOptions);
sentryConfig.Dsn = <REDACTED>;
// Don't need debug messages
sentryConfig.Debug = false;
// Set TracesSampleRate to 1.0 to capture 100% of transactions for performance monitoring. We recommend adjusting this value in production
sentryConfig.TracesSampleRate = 0.2;
// Sample rate for profiling, applied on top of the TracesSampleRate e.g. 0.2 means we want to profile 20 % of the captured transactions. We recommend adjusting this value in production.
sentryConfig.ProfilesSampleRate = 0.2;
// Note: By default, the profiler is initialized asynchronously. This can be tuned by passing a desired initialization timeout to the constructor.
sentryConfig.AddIntegration(new ProfilingIntegration(
// During startup, wait up to 500ms to profile the app startup code. This could make launching the app a bit slower so comment it out if you prefer profiling to start asynchronously.
TimeSpan.FromMilliseconds(500)
));
});
I'll report back in a few hours on the results.
Appears to still be an issue with 4.7.0 - turning it back off for now.
@haneytron is there any way you could help us reproduce this? Running things locally with 4.7.0, the memory usage with profiling enabled has been stable and minimal.
@jamescrosswell I can certainly do my best. The app in question is a production app that I can't really be experimenting with. But I can probably temporarily spin up a second slot that isn't registered to our primary domain and we can toy with that.
Intuitively, I think a few areas that are possibly causing the issue are:
BackgroundService
tasks that have a lifecycle of the entire app.I could try disabling each one and see what that does to performance, as a starting point. This breaks the functionality of the app, but of course we wouldn't care about that in a temporary deployment of the instance not being linked to prod traffic. Let me know your thoughts.
@haneytron that would be great... anything really to try to narrow it down to something we might be able to reproduce.
If it's possible, something else that would probably help is to get a memory dump using something like dotnet-gcdump.
@jamescrosswell sorry for the delay. I'll be testing this today.
I've deployed to a dedicated Azure AppService (Linux) slot. I've enabled profiling per the settings above (0.2/0.2) and am now waiting for the app to repro the perf issues. Once that happens I'll start trying variables (first up disabling SignalR usage).
@jamescrosswell yeah it's spinning out of control. I'm going to let it run for an hour or two until it starts timing out requests, then capture a memory dump. I'm not comfortable providing that in public given this is a private app and repo, under copyright, but I would be happy to supply an authenticated gdrive link if you can provide me an email address.
So after a few hours of literally 0 traffic here's the memory pattern and CPU pattern.
Memory:
The peaks here are 600MB of working set, my app normally runs sub-200. Looks like it eventually GCs.
CPU:
The peaks here are 55% CPU or so.
It cycles back to 0 when memory does. That seems to happen almost every 90 minutes on the nose. My app runs at about 5% CPU when I'm not profiling.
I can't get a memory dump because Azure reasons. I will now redeploy with SignalR disabled and see what happens.
Disabling SignalR didn't change things. I'll try disabling the HostedService
classes next.
Those drops in CPU are every 45 minutes almost on the dot. Probably an Azure hosting thing but mentioning it in case there's some "45 minute" timespan in the Sentry code somewhere and this gives a clue.
No change disabling all HostedService
instances either. I'm going to disable profiling and see what the baseline CPU is.
Thanks for the continuous trying and updates @haneytron.
So removing the profiling drops CPU back to normal. I think the best thing I can try next is just to deploy a very basic app (literally nothing but a hello world in MVC) to the same slot and see what happens.
I've hacked my app down to literally the bare bones, a single route that just shows a 404 page, and the issues persist. Here are some interesting graphs from today:
CPU - you can see pretty clearly where I first fired it up, and also where I disabled profiling, then later re-enabled it.
Memory - about double the usual working set when profiling is active.
Gen 0, 1, 2 GC are not exciting or out of the ordinary.
Server Errors: Ignore the initial spike, I had things configured incorrectly. 0 errors all day.
Total (HTTP) requests - again not exciting. Not a prod site instance, only traffic is me randomly playing.
Thread Count: not exciting, fairly stable.
Consistently low with profiling off, 30+ second bursts when on.
Two things next.
I'll leave it until it explodes or 24 hours goes by, and we'll see if things are working properly and well OR the issue actually remains and these graphs from Azure Portal are garbage. Honestly, they might be, perf counters are funky at best in my experience.
I'll report back tomorrow with the results!
You could also try to replicate my app and hosting env:
.NET Core 8 (MVC Web App) Azure hosted, app service, Linux, B1 Basic plan (~$13 USD/month), US East 2 region For this demo I manually published, no CI/CD
Here is the entirely of my current Program.cs
using CodeSession;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using Sentry.Profiling;
using System;
var builder = WebApplication.CreateBuilder(args);
var isDev = builder.Environment.IsDevelopment();
builder.Host.ConfigureServices((ctx, services) =>
{
// Enable MVC
services.AddControllersWithViews(options =>
{
options.SuppressAsyncSuffixInActionNames = true;
options.EnableEndpointRouting = true;
});
if (!isDev)
{
services.AddHttpsRedirection(options =>
{
// Prevents an error that occurs in the TRACE log and reports to Azure infrequently, seemingly at app startup
options.HttpsPort = 443;
});
}
});
// Sentry monitoring
builder.WebHost.UseSentry((ctx, sentryConfig) =>
{
// For Sentry config later on
var sentryOptions = new SentryOptions();
ctx.Configuration.GetSection("SentrySettings").Bind(sentryOptions);
sentryConfig.Dsn = sentryOptions.Dsn;
// Don't need debug messages
sentryConfig.Debug = false;
// Set TracesSampleRate to 1.0 to capture 100% of transactions for performance monitoring. We recommend adjusting this value in production
sentryConfig.TracesSampleRate = 0.2;
// Sample rate for profiling, applied on top of the TracesSampleRate e.g. 0.2 means we want to profile 20 % of the captured transactions. We recommend adjusting this value in production.
sentryConfig.ProfilesSampleRate = 0.2;
// Note: By default, the profiler is initialized asynchronously. This can be tuned by passing a desired initialization timeout to the constructor.
sentryConfig.AddIntegration(new ProfilingIntegration(
// During startup, wait up to 500ms to profile the app startup code. This could make launching the app a bit slower so comment it out if you prefer profiling to start asynchronously.
TimeSpan.FromMilliseconds(500)
));
});
// Build it
var app = builder.Build();
// Redirect from HTTP to HTTPS
app.UseHttpsRedirection();
app.UseRouting();
// Default to 404
app.MapFallbackToController(action: "NotFound", controller: "Error");
// Run it
await app.RunAsync();
Welp, bad news this AM. Multiple 502 and 503 errors (with NOTHING logged to console, app logs, or HTTP logs in Azure in terms of details), and multiple app restarts (as a result of my health checks I think) as well.
500 errors, starting when I deployed last night just after my last update:
Here are GSheets of the Console and HTTP logs for last 24 hours, limited to last 1000 logs, and all PII scrubbed:
https://docs.google.com/spreadsheets/d/1dqrHKF8DrBACtgoaQYxb7H59-8FO7SQhdYfVHogaZNI/edit#gid=940056293 https://docs.google.com/spreadsheets/d/1Wb9ZyWwiMUUplf6vOyNb37s7V6W5Q-Lvcj1YIErP2ek/edit#gid=869357884
Note that all of the 503, 502, 499s happen with the SignalR endpoints (/session/live/* routes) so my best guess is SignalR / WebSocket usage is an issue with the profiler.
Let me know how I can be of further help. Disabling profiling for now.
Thx for all the updates. Any chance you could try running dotnet-trace profiler (with Sentry builtin profiling disabled at the time) on your app and see how that influences cpu usage? https://learn.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-trace#collect-a-trace-with-dotnet-trace
@vaind ran a trace for almost 3 hours, ending just now:
root@b4ee333c5f1c:~/site/wwwroot# dotnet-trace collect --profile cpu-sampling -p 106 -o /dev/null
Provider Name Keywords Level Enabled By
Microsoft-DotNETCore-SampleProfiler 0x0000F00000000000 Informational(4) --profile
Microsoft-Windows-DotNETRuntime 0x00000014C14FCCBD Informational(4) --profile
Process : /usr/share/dotnet/dotnet
Output File : /dev/null
[00:02:50:37] Recording trace 0.00 (B)
Press <Enter> or <Ctrl+C> to exit... (B)
Stopping the trace. This may take several minutes depending on the application being traced.
Trace completed.
(output to dev/null to avoid file storage issues and I/O penalties)
Result in CPU was high, but also profiling @ 100%:
Response times remained good (the longer ones are incorrect measuring on long-lived WebSockets SignalR conns):
No 5xx or 499 thrown during this period:
Average memory working set was constant w/ profiler active (no 2x jump like with the Sentry profiler):
Console / App / HTTP logs are clean. No app service restarts since I turned off the Sentry profiler yesterday afternoon.
I suspect SignalR / WebSockets issues w/ the Sentry profiler remain.
I've identified at least one memory leak in perfview EventPipe processing where it calls GC.KeepAlive()
on objects so they never get freed. However, I can still see the memory growing, although not on the managed side so there may be something else going on as well.
TLDR: not resolved yet
I'll post a PR decreasing the circular buffer we use when setting up a diagnostic session but other than that, the main problem seems to be with object pinning in perfview and changing that would require changing quite a few things. In short, EventPipeScource
doesn't copy incoming buffer but instead maps structs (EventPipeEventHeader
) directly onto the buffer (unsafe but OK if pinned). To amplify the problem, parts of these structs are then referenced directly (as pointers) and it's not clear to me whether objects containing these copied pointers would live outside the event handles (it seems like they would). I'm not sure about the scope of the changes needed to make this work without pinning (or without a pinning handle leak while still staying safe), but I have a feeling it would require a lot of effort.
Package
Sentry.AspNetCore
.NET Flavor
.NET
.NET Version
.Net 7 / .Net 8
OS
Any (not platform specific)
SDK Version
4.1.2
Self-Hosted Sentry Version
No response
Steps to Reproduce
Enable profiling through the following setup:
Expected Result
The CPU overhead is about +5 to 10%.
Actual Result
The CPU overhead is around around +70%, which is seen through google cloud monitoring system. The following Sentry packages are used: Sentry 4.1.2 Sentry.AspNetCore 4.1.2 Sentry.Profiling 4.1.2
Slack conversation Zendesk ticket