getsentry / sentry-dotnet

Sentry SDK for .NET
https://docs.sentry.io/platforms/dotnet
MIT License
584 stars 206 forks source link

SentryHttpMessageHandler issue grouping is too eager #3582

Open ericsampson opened 1 month ago

ericsampson commented 1 month ago

Problem Statement

When using the SentryHttpMessageHandler, errors from all called URLs get lumped into one Issue. That's not great as the grouping is too eager; it's not very helpful to bundle HTTP errors from 25 different microservices and 15 external SaaS providers into a single Issue.

Solution Brainstorm

I'd love if the grouping could use an auto-parameterized version of the called URL as a grouping key.

jamescrosswell commented 1 month ago

I'd love if the grouping could use an auto-parameterized version of the called URL as a grouping key.

Hey @ericsampson ,

Could you give an example of some URLs that are being grouped together and how these would ideally be grouped instead?

bitsandfoxes commented 1 month ago

We're always looking at ways to improve the out-of-the-box experience. So some examples would be hugely appreciated.

As one way to unblock yourself, you could look into setting your own custom fingerprint for events.

ericsampson commented 1 month ago

@bitsandfoxes re the custom fingerprint option;

ericsampson commented 1 month ago

@jamescrosswell unless I'm misreading something in the dashboard, it currently appears that all called URLs are grouped into one single Issue. Because the default grouping that Sentry is choosing for these events is just the stack trace, which is identical regardless of the called URLs.

In my ideal world, each of these lines would be captured as a separate Issue:


GET foo.com/blah
GET bar.org/blah
POST bar.org/blah
// The handler should autodetect that the following called URL pattern is `foo.com/org/{id}/division/{id}`
// via tokenizing the called URL for patterns of `/{number}/?' and `/{UUID}/?` and '/{bool}/?`
// This isn't 100% perfect because it won't catch textual URL path params, but should cover 95% of cases seen in the real world.
GET foo.com/org/1/division/1, foo.com/org/2/division/1, foo.com/org/2/division/2 
jamescrosswell commented 1 month ago
  • how would I correctly use this with the SentryHttpMessageHandler.cs, i.e. what SentryEvent property should I test against to know in a non-fragile manner that the event source was the SentryHttpMessageHandler ? It sets HttpClientOrigin, but that's an internal prop currently.

@ericsampson the HttpFailedRequestHandler sets an exception mechanism, so you can do something like this:

    options.SetBeforeSend((evt, hint) =>
    {
        if (evt.SentryExceptions?.Any(x => x.Mechanism is { Type: "SentryHttpFailedRequestHandler" }) is true)
        {
            if (evt.Request.Url != null)
            {
                evt.Fingerprint = [ evt.Request.Url ];
            }
        }
    });

You can get a bunch of other information via the Hint as well (see docs).

jamescrosswell commented 1 month ago

@jamescrosswell unless I'm misreading something in the dashboard, it currently appears that all called URLs are grouped into one single Issue. Because the default grouping that Sentry is choosing for these events is just the stack trace, which is identical regardless of the called URLs.

I see... I guess depending on the HttpStatusCode it may or may not make sense to group by Uri. So it might be preferable to group all the 429 Too Many Requests responses together (same for 503 Service Unavailable) but something like 403 Forbidden you probably want to group by URI (since the specific location is relevant to the error).

Various other things like the request content and headers could also be relevant.

I wonder if there is some "default behaviour" Sentry could provide, out of the box, that would provide useful fingerprinting in at least the majority of situations (without being able to know anything specific about the various use cases)... 🤔

ericsampson commented 1 month ago

@jamescrosswell I don't think there should be any status-code-dependent grouping for the default behavior. It should group by method + URL

There's no way to universally classify any HTTP code as "entire server" vs "this specific HTTP call" IMO. For example it's very possible for 429 to be per-endpoint rate limits, and e.g 503 only means that the server is not unavailable to respond to that specific request--it could totally be possible for it to respond successfully to a different URL depending on what's going on behind the scenes. And the same argument can be made for any other 500 or 400 code that I can think of; HTTP responses are resource-targeted, and so you can't generically classify any HTTP codes as "per resource" vs "entire server" Anything that involves more detail about specific codes would be only relevant to a given implementation, and so the SDK consumer would have to override the default behavior with their system knowledge, if desired

jamescrosswell commented 1 month ago

I don't think there should be any status-code-dependent grouping for the default behavior. It should group by method + URL

Yeah, reading through this again I think I agree with you... how HTTP Status codes are used is going to be specific to the server application.

I'm wondering how we could auto-parameterize URL though. For incoming requests, we can use Route data such as Endpoints etc. but in the case of outgoing HttpClient requests there's less structured information available to infer which parts of the URL are parameters and which are not (or whether a parameter should be included or scrubbed for the purpose of grouping).

ericsampson commented 1 month ago

I'm wondering how we could auto-parameterize URL though

For sure it's not going to be 100% perfect, but I think that this is one of those cases where "good enough is good enough" and "don't let perfect be the enemy of good". Start with a proposal like the following and then can adjust as user feedback is gained.

Here's what I would suggest:

This isn't perfect because it won't catch textual URL path params, but should cover 95% of cases seen in the real world data 100% fabricated ;)

jamescrosswell commented 3 weeks ago

When using the SentryHttpMessageHandler, errors from all called URLs get lumped into one Issue. That's not great as the grouping is too eager; it's not very helpful to bundle HTTP errors from 25 different microservices and 15 external SaaS providers into a single Issue.

btw @ericsampson how is it that you end up with the same stack trace for calls to lots of different external URLs? Are you looking up the actual URL to be called dynamically (rather than specifying this in code) prior to making the Http request?

I'm mainly asking because I'd like to create a simple way to reproduce this problem.

ericsampson commented 2 weeks ago

@jamescrosswell in our monolith, we have created tons of different external HTTP clients that all derive from a single base class HttpClientBase.cs that exposes methods like public async Task<Result<T>> GetAsync<T>(string url, ...), and the derived classes construct the required urls dynamically call these methods. That seems very normal to me.

The HttpClientBase.cs holds injected Microsoft HttpClient with an inner SentryHttpMessageHandler.

For every http call made by the HttClientBase that results in the SentryHttpFailedRequestHandler > response.EnsureSuccessStatusCode() throwing, the resulting Sentry events ALL get reported as a single sentry Issue with the following exact same grouping details as shown in the attached screenshot. Which is problematic for us, and the motivation for this ticket :)

We have not added any sentry dashboard custom Issue Grouping rules, so what we are seeing is stock Sentry behavior.

Does that help? Thanks!!

Image

jamescrosswell commented 2 weeks ago

OK thanks @ericsampson. I think we can put together some code that will ensure the same stack trace when calling multiple different URLs, to simulate your scenario... From there it should be fairly simple to extend the Sentry SDK so that it has some callbacks or something on the options that you could use to customise the fingerprint for those outgoing URLs (and some default implementation of this that we provide out of the box).

ericsampson commented 2 weeks ago

That sounds awesome, thanks James!

bruno-garcia commented 2 weeks ago

@ericsampson is there a link to an issue in Sentry you you share? I wonder if perhaps the frames InApp flags are not preperly marked. Are the stack traces not including all frames leading up to the call site? Where your code started the HTTP request?

If not, I wonder if the stack trace is losing frames somehow.