getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
38.86k stars 4.17k forks source link

Investigate the performance transactions of a user, given an identifying tag #55706

Open aellett opened 1 year ago

aellett commented 1 year ago

Problem Statement

We occasionally receive a report that “Executive E saw odd behavior in the app” and the request is to understand why.

We’ve got an assortment of manually instrumented performance transactions that should capture some or all of the user’s journey. However, we find that we’re not able to reliably find any transactions associated with that user even though we have 1) a unique user identifier in the form of a tag set on the scope of that session and 2) the sampling rate set at 100%.

Looking to identify how we can approach digging into a user's performance transactions like this. It seems like this is the result of dynamic sampling, which works great for looking at data at scale, but doesn't seem to meet our needs for this type of investigation. Is there some other way to use the tool that would be better?

Solution Brainstorm

I'm wondering if dynamic sampling is what might be making things difficult for us here. The Sentry console is generally very quick to respond, and I imagine that the tradeoff for that is that fewer performance transactions would be indexed. It probably wouldn't be very useful in the console, but maybe the API could have some endpoint/options that takes longer to return, but is sampled less.

Product Area

Discover

┆Issue is synchronized with this Jira Improvement by Unito

getsantry[bot] commented 1 year ago

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] commented 1 year ago

Routing to @getsentry/product-owners-performance for triage ⏲️

brentc commented 1 year ago

Thanks for the report. I'll share this with the team to discuss further.

ale-cota commented 1 year ago

Hi Ace, Thanks a lot for this feedback and describing your use case. With dynamic sampling at the moment we try to index a good amount of samples for each project and transaction, however it's indeed not possible to guarantee that a very specific transaction or trace from the past was sampled. I understand how that would come in handy in this scenario, however we don't have enough similar feedback around this yet to consider a different behaviour. We will however keep collecting input on this topic and likely revisit it again in the future.

A different functionality that is in early stages of development is however something like an "investigation/troubleshooting mode" that might partially address the same need. When no sample transactions are found as a result of a search in Discover or Performance, we could temporarily "bias" our sampling algorithm to collect any future samples that match the criteria for a certain amount of time. This would enable customers to temporarily set a lens on a specific problem and ensure that samples are gathered. Do you perhaps have any thoughts on this? Any input is greatly appreciated.

Thanks a lot!

aellett commented 1 year ago

Hi @ale-cota, thanks for following up. Not surprised to hear that, based on my (admittedly limited) insight into dynamic sampling. We've been trying to think of any other ways to be able to achieve this kind of thing, and haven't come up with anything yet. What sounded most promising to us was something like tagging (all) transactions of some users with something like Dont-Sample: true. It would be sort of lame on the client side to be checking hardcoded user identifiers (or even server-specified user IDs), but I think we could probably make that work if Sentry's ingestion process was able to recognize that tag and respond to it. Though I would guess that the potential for abuse of that sort of thing would cause problems.

The other functionality that you mentioned is interesting, though for me it would apply in a slightly different case. Today I was working with one of our QEs to verify a transaction that I'd deployed to an RC recently. We wanted to make sure we could find the transactions from his build, but it seemed like they were getting sampled out (since no one else had the build). When we went down to a lower environment, we started seeing the transactions immediately. That's definitely a case where I would love to be able to do something to "bias towards User X" for the next 15 minutes or so. In that situation, we could definitely know that we are going to run the tests again, so something like that would work really well. I think I would probably be slightly wary if I knew that was happening automatically (e.g. based on my recent searches), but I don't know that I would be opposed to that if that's the best way to work that. On the other hand, If it was a project configuration that needed to be set, that would be one more thing that we'd probably need to train on and protect access for.