getsentry / sentry-python

The official Python SDK for Sentry.io
https://sentry.io/for/python/
MIT License
1.88k stars 493 forks source link

Add context to areas in Performance Tracing where manual instrumentation is required #988

Closed blackrobot closed 1 year ago

blackrobot commented 3 years ago

An issue I'm having is that the most poorly performing areas of my application appear as "Missing Instrumentation" on the Sentry performance page.

I understand that I can manually add the code to capture these traces, but I have no idea where to include that code. The trace doesn't give me the method that it came from, the module that it belongs to, or any clues as to where I implement the manual tracing code.

Additionally, it looks like many of the missing frames are from Django Rest Framework, which I can't easily wrap with manual transaction capturing.

It would be helpful if the missing areas had some sort of indication as to where they came from. It would also be very helpful if commonly used libraries like encode/django-rest-framework had default performance tracing integration out-of-the-box.

Sentry Performance dashboard missing trace frame

untitaker commented 3 years ago

Yeah this is a known issue. There are many proposals to obtain a stacktrace for instrumentation gaps, however the core problem here is that this seems hard to do in a performant way, and to only capture stacktraces when there is a gap.

The remark about Django Rest Framework is noted too.

ricky-sb commented 3 years ago

@untitaker @blackrobot Any thoughts on a workaround? What if we do it in a non-performant way?

apinsard commented 3 years ago

From what I noticed "Missing instrumentation" is any gap slower than 100ms between 2 spans. So this is basically all your code between the two spans (or between spans and transaction bounds if it's in first or last position). Of course, in gap there is likely a part that is more time consuming than the rest.

Still from what I noticed, I think that this might often be due to dealing with the memory with large amount of data.

For instance, here:

Screenshot 2021-08-19 at 15-41-50 Performance - Event Details - cocoonr - Sentry

This is my code triggering these spans:

        existing_entries_queryset = (                                                                                                                                                                                                                                                                                       
            HomeDailyStats.objects                                                                                                                                                                                                                                                                                          
            .filter(home_id__in=home_ids)                                                                                                                                                                                                                                                                                   
            .filter(date__gte=self.start_date)                                                                                                                                                                                                                                                                              
            .order_by('home_id')                                                                                                                                                                                                                                                                                            
        )                                                                                                                                                                                                                                                                                                                   
        if not self.use_booking_date:                                 
            existing_entries_queryset = existing_entries_queryset.filter(     
                date__lte=self.end_date,                               
            )                                                                  
        existing_entries = {                                               
            h: {x.date: x for x in s} for h, s
            in groupby(existing_entries_queryset, lambda x: x.home_id)
        }                                                 

        payout_expressions_queryset = (
            AccommodationPayoutExpression.objects                  
            .filter(accommodation_id__in=home_ids)
            .order_by('accommodation_id', '-starting_from')
        )                                                                                                                                                     
        payout_expressions = {                                
            h: PayoutExpressions([(x.starting_from, x) for x in p]) for h, p
            in groupby(payout_expressions_queryset, lambda x: x.accommodation_id)                                                                             
        }

The "Missing instrumentation" represents this part (excluding the execution of existing_entries_queryset which is the previous span):

        existing_entries = {                                               
            h: {x.date: x for x in s} for h, s
            in groupby(existing_entries_queryset, lambda x: x.home_id)
        }   

So what takes 758ms is the processing of the groupby + the construction of the dictionnary and the subdictionnaries. The reason why it's so slow is because it has to deal with a large amount of data.

Taskle commented 2 years ago

We're having this same issue, and we also use Django + Django Rest Framework.

I know that there are some competitive platforms that do line-by-line profiling out-of-the-box. It's been really convenient to use Sentry for everything, so we'd love to keep using Sentry for performance profiling! Unfortunately the vast majority of slowness in our app is in this "missing instrumentation" areas, and implementing instrumentation every place we see would be significantly more time-consuming than finding a new platform that supports line-by-line profiling out of the box.

We're totally happy to use a non-performant solution that we can explicitly enable temporarily while we're trying to identify the problematic areas. Would it be possible to make it so we can enable this automatically, even in a non-performant way, temporarily so we can identify why our django code is slow in all of these "missing instrumentation" areas?

Alternatively, if anyone has a recommended alternative that they're using to solve this problem in the meantime until Sentry supports it, please share -- it would be greatly helpful for us and anyone else with this problem! :)

Thank you! David

alexgmin commented 2 years ago

This may be of help.

Recently we debugged an issue in which we had around 1s of slowness in some requests in missing instrumentation sections. The slowness was on AWS's sdk, boto3, but the reason that it was on a missing instrumentation area was that it was code loaded dynamically using the import_string from django.utils.module_loading. Adding manual instrumentation to the code loaded dynamically fixed it.

Taskle commented 2 years ago

Thanks -- in our case doesn't look like we use any of this. We're going to set up Datadog for now which has line-by-line profiling out of the box without custom instrumentation until Sentry can support this without custom instrumentation. In the meantime, if folks find solutions with Sentry that allow us to not have to comb through our app to add custom code for these "missing instrumentation" sections, feel free to share!

github-actions[bot] commented 2 years ago

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog or Status: In Progress, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

drmrbrewer commented 2 years ago

I'm finding this with performance tracing for my node.js (with express) application (set up as per this guide):

image

It's incredibly frustrating, when the only bit that isn't performing as it should is the only bit you know nothing about!

bharling commented 2 years ago

Same here with profiling django rest framework specifically - Have stepped through my code in enormous detail now and added manual instrumentation anywhere I can along the way but still seeing massive spans with missing instrumentation. All the code / database interactions I can actually see are really performant but the missing instrumentation spans are disastrously slow :(

danielkhan commented 2 years ago

@drmrbrewer - sorry for the late reply. I am currently looking through cases like what you showed here. Could you figure out what caused the gap here and what was in there? We want to avoid something like this in the future.

ipmb commented 2 years ago

Here's an example we see in a pretty vanilla Django application. Based on the context, I suspect it is a boto3 call to S3 doing a PutObject.

image

danielkhan commented 2 years ago

@ipmb thank you. To validate an assumption I have: If you suspect which call this could be, what are the reasons to not instrument it manually?

ipmb commented 2 years ago

If you suspect which call this could be, what are the reasons to not instrument it manually?

It defeats the utility of Sentry. When I follow the directions here, I want things to "just work". Not all the gaps are so obvious and it may take significant time to figure out what code even needs additional instrumentation. If I have to go hunting for the spots that aren't instrumented and add extra code to perform that instrumentation, Sentry is providing a lot less value to me.

A corollary might be if Sentry caught some errors in my app, but not all of them and I had to go hunting through the code to figure out where the missing errors are coming from and then add extra code to catch them. That would be a lot less useful to me as a general error-tracking service.

Taskle commented 2 years ago

You articulated so eloquently what I failed to articulate above. I wholeheartedly agree!

sl0thentr0py commented 2 years ago

Since this is becoming a general product discussion, I have some thoughts that I'd like to articulate as well and am mostly just thinking out loud here.

In general, there is a profiling vs tracing dichotomy at play here that has certain consequences.

Profiling implies recording every function call at a certain sample rate. This results in a very detailed understanding of what the end-user program does but is very noisy and lacks semantic understanding of the functions in play.

Tracing (as Sentry currently does it) is very selective in choosing what to instrument, namely, specific parts of the stack (http requests/db calls, etc) in a way that's immediately actionable, not noisy and has some semantic understanding of the code being instrumented (by way of span ops/descriptions). This semantic understanding is there because the SDK knows beforehand what it is instrumenting at any given point; it is an explicit choice to instrument a certain part of the program.

This pickiness obviously results in missing blocks wherever we don't know that we have to instrument something. One way to fix this is to identify and plug these holes in the SDK/framework integrations.

Another way (though probably a terrible idea overhead wise) is to move slightly closer to the profiling world by possibly collecting more contextual data on every function call and augment missing time windows with some of this metadata so we can at least give pointers to missing instrumentation. This has to be done in a way that doesn't have massive overhead so I'm not sure how feasible it would be.

ipmb commented 2 years ago

What I've seen from another APM is tracing with a very small sample set of requests getting profiled. That lets me identify a problem and then deep-dive into the profile to see exactly what's going on.

In the specific example I posted above, I suspect that all HTTP requests are not getting traced. I don't need to know the specific boto3 function being used, but if I see an HTTP request to s3.amanzonaws.com, that's enough to tell me what's going on.

sl0thentr0py commented 2 years ago

thx @ipmb, yes I believe we might be missing HTTP requests if the underlying http lib is not one of the standard ones, will have to see what boto uses.

antonpirker commented 1 year ago

Hi! We have now a profiling product and this solves this problem. If you enable profiling [1] you will see a link to the profile in the "missing instrumentation" section of your performance waterfall. This will tell you what functions are not instrumented. Hope this helps.

1: https://docs.sentry.io/platforms/python/profiling