getsentry / sentry-dotnet

Sentry SDK for .NET
https://docs.sentry.io/platforms/dotnet
MIT License
566 stars 203 forks source link

Some errors not captured if out of performance units #3283

Open bruno-garcia opened 1 month ago

bruno-garcia commented 1 month ago

A customer on ASP.NET Core said that when they are out of performance quota, some errors are not captured.

Looking at an event, it's going through this middleware:

image

That's here:

https://github.com/getsentry/sentry-dotnet/blob/480e50830b6ede50a5227028a55c3e2ee1c61090/src/Sentry.AspNetCore/SentryTracingMiddleware.cs#L139

Since we only capture the exception further down, I wonder if we're possibly throwing an exception somewhere here:

https://github.com/getsentry/sentry-dotnet/blob/480e50830b6ede50a5227028a55c3e2ee1c61090/src/Sentry.AspNetCore/SentryTracingMiddleware.cs#L149-L210

And not reaching this:

https://github.com/getsentry/sentry-dotnet/blob/480e50830b6ede50a5227028a55c3e2ee1c61090/src/Sentry.AspNetCore/SentryTracingMiddleware.cs#L213-L216

Otherwise I'm not sure why such error would not be captured unless they are not capturing transactions (or are out of quota).

The exceptions are captured by SentryMiddleware.UnhandledException so I might be looking at it wrong as SentryMiddleware was hidden:

image

https://github.com/getsentry/sentry-dotnet/blob/480e50830b6ede50a5227028a55c3e2ee1c61090/src/Sentry.AspNetCore/SentryMiddleware.cs#L164

akonyer commented 1 month ago

I am the one who is experiencing this issue. It seems very strange, but with no changes to the codebase at all, when we run out of transaction quota on our performance metrics, errors stop coming through. The interesting thing is that it's not ALL errors - just some. some still come through, but it is a massively reduced amount. It's like we're being throttled.

We have only used ~5% of our error reserve (15k/300k), but 100% of our performance reserve is used up.

I will get 500 errors reported from the front end (web application) and sent to sentry when an API crashes with a 500, but the back end does not sent the corresponding error to sentry.

I noticed that when I add more funds to my on-demand-budget to allow performance metrics to be captured again, then errors come through just fine. I keep having to add to this on demand reserve every day to get errors to come through until our month resets in a couple of days.

It does seem like a bug in the C# library that is somehow dropping errors when the performance logs are also dropped. And when the performance logs come through just fine the error logs do as well.

This is my config for the .net backend apps:

        builder.UseSentry(config =>
        {
            config.Dsn = dsn;
            config.Environment = serverName;
            config.MaxRequestBodySize = RequestSize.Medium;
            config.SendDefaultPii = true;
            config.IncludeActivityData = true;
            config.ServerName = serviceName;
            config.TracesSampleRate = tracesSampleRate;

            config.TracesSampler = tracesSampler;

            if (openTelemetryEnabled)
            {
                config.UseOpenTelemetry();
            }
        });
jamescrosswell commented 1 month ago

@bruno-garcia how/where do quotas affect the client SDKs? Is it done via Rate Limiting in the transport or is there some other mechanism?

bruno-garcia commented 1 month ago

@bruno-garcia how/where do quotas affect the client SDKs? Is it done via Rate Limiting in the transport or is there some other mechanism?

yeah only through X-Sentry-Rate-Limit, afaik.

@akonyer by any chance are you using openTelemetryEnabled ? Wondering if that can help narrow down.

Also, are the errors just application related? Or they seem not related to your code?

akonyer commented 1 month ago

No, we currently have Open Telemetry flagged off.

The errors are application code, not framework level or anything. Just crashes/exceptions in our code.

jamescrosswell commented 1 month ago

The interesting thing is that it's not ALL errors - just some. some still come through, but it is a massively reduced amount. It's like we're being throttled.

This seems like it could be an important clue... if it was possible to spot some pattern in those that are being dropped vs those that are not. So do the errors that you do see coming through have anything in common?

Also, this is maybe not possible in the environment where you're experiencing this but if exceptions are being dropped then that would be logged by the SentrySDK (but only if you had configured SentryOptions.Debug = true): https://github.com/getsentry/sentry-dotnet/blob/bc68b8e60233e9987df8b03759e00720c28a52ae/src/Sentry/Http/HttpTransportBase.cs#L101-L109

The presence or absence of such log messages might indicate a very different cause for the problem (whether the problem lies in the .NET SDK or in the Sentry server).

jamescrosswell commented 1 month ago

Also, this is maybe not possible in the environment where you're experiencing this but if exceptions are being dropped then that would be logged by the SentrySDK (but only if you had configured SentryOptions.Debug = true)

Oh, in fact... we should be able to see that in the Client Reports right? @bruno-garcia / @bitsandfoxes are those reports available in the Sentry UI somewhere? What happens to them after they're sent to Sentry?

akonyer commented 1 month ago

Unfortunately I did not have debug turned on in production. And now our month has ended and we've reset our profiling quota (which is great, things are working perfectly again). However that means that I can no longer provide any new information as I can no longer reproduce the issue.

When they were being dropped this was the messaging I was seeing sometimes:

[17:18:13 INF] Capturing event. [17:18:13 WRN] Failed to end session because there is none active. [17:18:13 INF] Envelope queued up: 'dfd0642e2f504e329faf4362653d4728' [17:18:13 ERR] HttpTransport: Sentry rejected the envelope 'dfd0642e2f504e329faf4362653d4728'. Status code: TooManyRequests. Error detail: .

And this error here makes me think we were hammering the API too frequently, but I don't see how that is possible considering we had used less than 10% of our errors quota for the month, and errors were not even coming through. We also have spike detection turned on, so you would think if we were hitting the API too frequently and getting rate-limited we would be seeing a lot of errors as well? Which we definitely did not see. Looking now we had 11500/300k errors for the month. And that was pretty event distributed across days in the month. No where near enough to be getting rate-limited by the API I don't think.

It seems like as soon as our Performance quota was reached ALL of our API requests were being rate limited.

jamescrosswell commented 4 weeks ago

[17:18:13 ERR] HttpTransport: Sentry rejected the envelope 'dfd0642e2f504e329faf4362653d4728'. Status code: TooManyRequests. Error detail: .

OK. 429 Too Many Requests is the HTTP response code that Sentry uses for rate limits. Without knowing the contents of the envelope though, it's hard to say whether that was a Trace event or a Crash Report that was dropped.

If you could set options.Debug = true when initializing the Sentry SDK, then the contents of the envelopes would be logged when they get dropped:

https://github.com/getsentry/sentry-dotnet/blob/59f31642c10f8c8159d8294e0a48536b14ad4cd0/src/Sentry/Http/HttpTransportBase.cs#L361-L362

If the logs showed a 429 followed by an envelope containing an exception being dropped, that would indicate an issue with how rate limits are being handed for your account.

akonyer commented 3 weeks ago

I wont be able to do that until our performance metrics run out at the end of the month. And I'm hoping we've balanced the number of sampling so that we make it to the end of the month. If I notice this again I will turn on debug for you guys.

jamescrosswell commented 3 weeks ago

@bitsandfoxes do you think it's possibly related to this as well?

Once you've run out of performance units I think the tracing envelopes shouldn't even end up in the queue and so this would no longer play a role. However while you've still got performance units left, if you see a spike in volumes, random envelopes would be discarded (which may include crash reports).

@akonyer if this is relevant, you might be able to alleviate some of your problems by increasing SentryOptions.MaxQueueItems (e.g. set this to 100 or 200).

bitsandfoxes commented 3 weeks ago

Once you've run out of performance units I think the tracing envelopes shouldn't even end up in the queue and so this would no longer play a role.

As far as I can tell the rate-limiting kicks in only when an event/transaction (as an envelope) already reaches the HTTP transport sitting underneath the CachingTransport. So that behaviour should not change.

But I agree. A high volume of transactions drowning out errors should not happen in any case. We'll have to try reproducing this locally.

jamescrosswell commented 2 weeks ago

I think I've managed to reproduce this. I created the following simple console program:

var cts = new CancellationTokenSource();
var task = new Task(DoWork, cts.Token);
task.Start();

Console.WriteLine("Hello, World!");
Console.WriteLine("Press ESC to cancel...");
while (!task.IsCompleted)
{
    var keyInput = Console.ReadKey(true);
    if (keyInput.Key == ConsoleKey.Escape)
    {
        cts.Cancel();
    }
}
Console.WriteLine("Goodbye cruel world...");
return;

void DoWork()
{
    while (!cts.Token.IsCancellationRequested)
    {
        var transaction = SentrySdk.StartTransaction("Background", "loop");
        try
        {
            if (Random.Shared.Next(0, 500) != 0)
            {
                Thread.Sleep(100);
                continue;
            }

            throw new Exception("Test exception");
        }
        catch (Exception e)
        {
            SentrySdk.CaptureException(e);
        }
        finally
        {
            transaction.Finish();
        }

        SentrySdk.Flush();
    }
}

Then I let it burn for a bit on my Sentry account. You can see at about 10am I ran out of performance units on my quota:

Image

And then this is what happened to error reports:

Image

You can see I was getting 1-2 per minute while I still had quota and then as soon as I ran out of quota, nada for 17 minutes... and that was when I checked back in and stuck a breakpoint in the code to see what was going on.

The first error that I followed in my debugger was actually sent without any issue, which left me scratching my head (Heisenberg problem).

I then changed the code so that the errors would occur more frequently (just to make it easier to get to the breakpoint I wanted to trigger):

            if (Random.Shared.Next(0, 10) != 0)

After that I could follow some errors that were being dropped by Sentry. This is what I'm getting back from the server in the HttpResponse from here: https://github.com/getsentry/sentry-dotnet/blob/a9e8acd030efb232bb08ec3745c9ef9a479fca8d/src/Sentry/Internal/Http/HttpTransport.cs#L42

StatusCode: 429, ReasonPhrase: 'Too Many Requests', Version: 1.1, Content: System.Net.Http.EmptyContent, Headers:
{
  Content-Length: 0
}, Trailing Headers:
{
}

It's a generic 429 without any headers to explain why.

So the errors are being sent by the .NET SDK. It appears they're being rejected by the Sentry server (even though I still have plenty of error quota left on my account).

@bitsandfoxes / @bruno-garcia I think we need to raise this with another team but I'm not sure who to tap on the shoulder there.

akonyer commented 2 weeks ago

Amazing! Glad that you managed to reproduce this issue. I thought I was going crazy, but I KNEW that there were errors being dropped that should have made it.

It's interesting you guys think it is an issue on the server. I would have guessed it was more likely a client side issue in the .Net SDK. Otherwise I'd expect this to possibly be an issue across other languages as well.

bitsandfoxes commented 1 week ago

I'm not sure if this is better or worse. But I'll forward this to the appropriate team. Suuper appreciate the sleuthing @jamescrosswell!

Dav1dde commented 1 week ago

It would be super helpful to get the list of all headers returned from the webserver and also the content which was sent (potentially an event envelope mixed with a transaction?).

Re headers: There must have been headers returned, without headers it wouldn't be a valid HTTP response.

jamescrosswell commented 1 week ago

It would be super helpful to get the list of all headers returned from the webserver and also the content which was sent (potentially an event envelope mixed with a transaction?).

There are definitely headers returned however the .NET SDK doesn't (currently) log these or dump them anywhere, for obvious performance reasons.

If rate limiting headers are present, this will result in client reports for any events being dropped: https://github.com/getsentry/sentry-dotnet/blob/0a8a92aa74eb840d9cc6454022ff140c475f38fa/src/Sentry/Http/HttpTransportBase.cs#L97-L104

If we wanted more information, we could potentially log both the headers and the corresponding rate limit that we were applying internally here: https://github.com/getsentry/sentry-dotnet/blob/0a8a92aa74eb840d9cc6454022ff140c475f38fa/src/Sentry/Http/HttpTransportBase.cs#L280

We'd have to be careful about that as that code gets called very frequently. Certainly we'd only do it if Debug was enabled and I think we'd do it in a branch of the code that we didn't intend to merge into main.

@Dav1dde the other issue is that this is currently quite challenging to reproduce for me. To reproduce this currently I need to:

  1. Create a new throw away organization and project
  2. Disable spike protection on the project
  3. Change the config on my test code to use the new DSN
  4. Run the test code above for ~15 minutes until performance units are exhaused
  5. Then the problem manifests.

That's not ideal for Sentry (I end up pounding the Sentry servers and storing a bunch of junk). It's not ideal for me as it takes a very long time to reproduce the problem on each debug attempt.

Can you suggest a better way to do this?

jamescrosswell commented 23 hours ago

@Dav1dde my quota was reset so I've generated some more comprehensive logs (including the headers that get returned by Sentry).

The initial rate limiting headers I'm getting back all look good:

  Debug: Envelope handed off to transport (event ID: '75a3027b1552485ea23111f931ca9320'). 1 items in queue.
  Debug: HttpTransport: Received rate limit header: 60:transaction;profile:organization:transaction_usage_exceeded
  Debug: HttpTransport: Rate limit 'transaction' applied. Retry after: 00:01:00.
  Debug: HttpTransport: Received rate limit header: 60:transaction;profile:organization:transaction_usage_exceeded
  Debug: HttpTransport: Rate limit 'profile' applied. Retry after: 00:01:00.
  Error: HttpTransport: Sentry rejected the envelope '75a3027b1552485ea23111f931ca9320'. Status code: TooManyRequests. Error detail: Sentry dropped data due to a quota or internal rate limit being reached. This will not affect your application. See https://docs.sentry.io/product/accounts/quotas/ for more information.. Error causes: .
  Debug: HttpTransport: Failed envelope '75a3027b1552485ea23111f931ca9320' has payload: {"sdk":{"name":"sentry.dotnet","version":"4.4.0"},"event_id":"75a3027b1552485ea23111f931ca9320","trace":{"trace_id":"429aa33b9b5a4b1892b151b3282ca46b","public_key":"d95b9d95c42b4b0a6f3180c612bee116","sampled":"true","sample_rate":"1","release":"Sentry.Samples.Console.Basic@4.4.0\u002B8c78887bc9291b88938ac56d3dd59124668e5c18","environment":"debug","transaction":"Background"},"sent_at":"2024-05-15T01:59:04.258502+00:00"} {"type":"transaction","length":1091} {"type":"transaction","event_id":"75a3027b1552485ea23111f931ca9320","platform":"csharp","release":"Sentry.Samples.Console.Basic@4.4.0\u002B8c78887bc9291b88938ac56d3dd59124668e5c18","transaction":"Background","transaction_info":{"source":"custom"},"start_timestamp":"2024-05-15T01:59:04.10275+00:00","timestamp":"2024-05-15T01:59:04.20383+00:00","request":{},"contexts":{"app":{"type":"app","app_start_time":"2024-05-15T01:58:59.6346108+00:00"},"device":{"type":"device","boot_time":"2024-04-28T23:05:01.6727338+00:00"},"os":{"type":"os","raw_description":"Darwin 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:49 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6020"},"runtime":{"type":"runtime","name":".NET","version":"6.0.22","raw_description":".NET 6.0.22","identifier":"osx.14-arm64"},"trace":{"type":"trace","span_id":"c3ebd54a906b3af1","trace_id":"429aa33b9b5a4b1892b151b3282ca46b","op":"loop","status":"ok"}},"user":{"ip_address":"{{auto}}"},"environment":"debug","sdk":{"packages":[{"name":"nuget:sentry.dotnet","version":"4.4.0"}],"name":"sentry.dotnet","version":"4.4.0"}}

Those are the only two rate limit headers received in the session...

However later on the server rejects envelopes containing Error events (with no transaction)... For example, this one:

  Error: HttpTransport: Sentry rejected the envelope 'c41d2b6841504343a0d9c5580ea860cb'. Status code: TooManyRequests. Error detail: .
  Debug: HttpTransport: Failed envelope 'c41d2b6841504343a0d9c5580ea860cb' has payload: {"sdk":{"name":"sentry.dotnet","version":"4.4.0"},"event_id":"c41d2b6841504343a0d9c5580ea860cb","trace":{"trace_id":"a9da27f33b2d4ea7880d1afa5e6012b9","public_key":"d95b9d95c42b4b0a6f3180c612bee116","release":"Sentry.Samples.Console.Basic@4.4.0\u002B8c78887bc9291b88938ac56d3dd59124668e5c18","environment":"debug"},"sent_at":"2024-05-15T01:59:08.301707+00:00"} {"type":"event","length":4904} {"modules":{"System.Private.CoreLib":"6.0.0.0","Sentry.Samples.Console.Basic":"4.4.0.0","System.Runtime":"6.0.0.0","System.Console":"6.0.0.0","Sentry":"4.4.0.0","System.Net.Primitives":"6.0.0.0","System.IO.Compression":"6.0.0.0","System.Text.Json":"6.0.0.0","System.Collections":"6.0.0.0","System.Text.RegularExpressions":"6.0.0.0","System.Reflection.Emit.ILGeneration":"6.0.0.0","System.Reflection.Emit.Lightweight":"6.0.0.0","System.Threading":"6.0.0.0","System.Memory":"6.0.0.0","System.Reflection.Primitives":"6.0.0.0","System.Linq":"6.0.0.0","Microsoft.Win32.Primitives":"6.0.0.0","System.Private.Uri":"6.0.0.0","System.Diagnostics.Process":"6.0.0.0","System.ComponentModel.Primitives":"6.0.0.0","System.Net.Http":"6.0.0.0","System.Collections.Concurrent":"6.0.0.0","System.Runtime.InteropServices.RuntimeInformation":"6.0.0.0","System.Security.Cryptography.Algorithms":"6.0.0.0","System.Security.Cryptography.Primitives":"6.0.0.0","System.ObjectModel":"6.0.0.0","System.Diagnostics.DiagnosticSource":"6.0.0.0","System.Diagnostics.Tracing":"6.0.0.0","System.Net.Security":"6.0.0.0","System.Threading.Thread":"6.0.0.0","System.Security.Cryptography.X509Certificates":"6.0.0.0","System.Text.Encodings.Web":"6.0.0.0","System.Runtime.Intrinsics":"6.0.0.0","System.Numerics.Vectors":"6.0.0.0","System.Runtime.InteropServices":"6.0.0.0","System.Net.Sockets":"6.0.0.0","System.Threading.ThreadPool":"6.0.0.0","System.Net.NameResolution":"6.0.0.0","System.Collections.NonGeneric":"6.0.0.0","System.Security.Cryptography.Encoding":"6.0.0.0","System.Formats.Asn1":"6.0.0.0","System.Runtime.Numerics":"6.0.0.0","System.Text.Encoding.Extensions":"6.0.0.0","System.Runtime.CompilerServices.Unsafe":"6.0.0.0","System.IO.Compression.Brotli":"6.0.0.0","System.Diagnostics.StackTrace":"6.0.0.0","System.Reflection.Metadata":"6.0.0.0","System.Collections.Immutable":"6.0.0.0","System.Linq.Expressions":"6.0.0.0"},"event_id":"c41d2b6841504343a0d9c5580ea860cb","timestamp":"2024-05-15T01:59:08.257605+00:00","platform":"csharp","release":"Sentry.Samples.Console.Basic@4.4.0\u002B8c78887bc9291b88938ac56d3dd59124668e5c18","exception":{"values":[{"type":"System.Exception","value":"Test exception","module":"System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e","thread_id":7,"stacktrace":{"frames":[{"filename":"Program.cs","function":"void Program.\u003CMain\u003E$(string[] args)\u002BDoWork()","lineno":71,"colno":13,"abs_path":"/Users/jamescrosswell/code/sentry-dotnet/samples/Sentry.Samples.Console.Basic/Program.cs","in_app":true,"package":"Sentry.Samples.Console.Basic, Version=4.4.0.0, Culture=neutral, PublicKeyToken=fba2ec45388e2af0","instruction_addr":"0x3e","addr_mode":"rel:0","function_id":"0xc"}]},"mechanism":{"type":"generic","handled":true}}]},"level":"error","request":{},"contexts":{"Current Culture":{"name":"en-NZ","display_name":"English (New Zealand)","calendar":"GregorianCalendar"},"Dynamic Code":{"Compiled":true,"Supported":true},"Memory Info":{"allocated_bytes":5589504,"fragmented_bytes":321288,"heap_size_bytes":1804752,"high_memory_load_threshold_bytes":96894462197,"total_available_memory_bytes":103079215104,"memory_load_bytes":82463372083,"total_committed_bytes":4030464,"promoted_bytes":967448,"pinned_objects_count":4,"pause_time_percentage":0.06,"index":2,"finalization_pending_count":197,"compacted":true,"concurrent":false,"pause_durations":[1.205,0]},"ThreadPool Info":{"min_worker_threads":12,"min_completion_port_threads":12,"max_worker_threads":32767,"max_completion_port_threads":1000,"available_worker_threads":32766,"available_completion_port_threads":1000},"app":{"type":"app","app_start_time":"2024-05-15T01:58:59.6346108+00:00"},"device":{"type":"device","timezone":"NZ","timezone_display_name":"(UTC\u002B12:00) New Zealand Time","boot_time":"2024-04-28T23:05:01.6727338+00:00"},"os":{"type":"os","raw_description":"Darwin 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:49 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6020"},"runtime":{"type":"runtime","name":".NET","version":"6.0.22","raw_description":".NET 6.0.22","identifier":"osx.14-arm64"},"trace":{"type":"trace","span_id":"4461453c860111c1","trace_id":"a9da27f33b2d4ea7880d1afa5e6012b9"}},"user":{"ip_address":"{{auto}}"},"environment":"debug","sdk":{"packages":[{"name":"nuget:sentry.dotnet","version":"4.4.0"}],"name":"sentry.dotnet","version":"4.4.0"},"debug_meta":{"images":[{"type":"pe_dotnet","debug_id":"3b18ffa3-fe80-4df6-9490-4553d25b5371-fe8707cf","debug_checksum":"SHA256:a3ff183b80fef6ed54904553d25b5371cf07877ee8f38c74259686cbe8376a68","debug_file":"/Users/jamescrosswell/code/sentry-dotnet/samples/Sentry.Samples.Console.Basic/obj/Debug/net6.0/Sentry.Samples.Console.Basic.pdb","code_id":"AE90C9518000","code_file":"/Users/jamescrosswell/code/sentry-dotnet/samples/Sentry.Samples.Console.Basic/bin/Debug/net6.0/Sentry.Samples.Console.Basic.dll"}]}}

The response is obviously a 429 but there are no rate limit headers containing any explanation for why the 429. Spike protection has been disabled for this organisation.

Dav1dde commented 17 hours ago

Thanks for looking into this again. I would really need to see all headers and the full json response to pin point where this is coming from. Would it be possible to replicate this with a curl script I can run myself?

jamescrosswell commented 4 hours ago

@Dav1dde I could give you a .NET program that demonstrated the issue and you could use a packet sniffer like WireShark to record/inspect the traffic to and from Sentry? Would that work?

At first I thought this wouldn't be practical as there's no easy way to have Wireshark decrypt SSL traffic. However I think we could work around that if the .NET program is configured to use the DSN of a Sentry relay (running on HTTP).

I'll see if I can put a github repo together with a docker-compose file for the relay, the .net project itself and instructions on how to compile/run everything from the command line.