getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
39.22k stars 4.21k forks source link

ref(transactions): add functionality to sample transactions in save_event_transaction #81077

Open JoshFerge opened 1 day ago

JoshFerge commented 1 day ago

first PR of https://github.com/getsentry/sentry/issues/81065

no behavior changes until option is turned on. need to change post_process as well before we can turn the option on. equivalent of https://github.com/getsentry/sentry/blob/b84d3d9238ab75822b076d16de1111c3035fe073/src/sentry/tasks/post_process.py#L614 in post_process

lynnagara commented 1 day ago

What are the implications of transactions sampling not happening exactly once?

Won't this potentially sample some messages in flight twice or not at all depending on the direction of option change

lynnagara commented 1 day ago

I think https://github.com/getsentry/sentry/pull/81079 could probably be part of this same PR, otherwise all this option does is double sample.

JoshFerge commented 1 day ago

I think https://github.com/getsentry/sentry/pull/81079 could probably be part of this same PR, otherwise all this option does is double sample.

@lynnagara the goal of the PRs is to be reviewed separately, not to be deployed separately. i would never turn the option on without the other one deployed

JoshFerge commented 1 day ago

What are the implications of transactions sampling not happening exactly once? Won't this potentially sample some messages in flight twice or not at all depending on the direction of option change

from reading about the transactions sampler, i believe it is a feature that helps with parameterizing high cardinality transaction names. i don't believe double sampling (wouldn't happen as i've implemented)

or not having samples (would be the case for a few moments for this deploy) would have a large effect on the product.

https://develop.sentry.dev/api-server/application-domains/transaction-clustering/#accidental-erasure-of-non-identifiers

JoshFerge commented 15 hours ago

We originally put the "record transaction" logic in post process to move it off the critical path. But if that task is going to disappear completely, this makes sense.

Yeah, that makes. Unfortunately at this point the load we're putting on RabbitMQ and potential instability it causes isn't worth having a whole other task for transactions. I think once we have more durable execution with task workers, it's certainly worth investigating breaking up again.

JoshFerge commented 14 hours ago

Looks good with #81079 so that we aren't double sampling.

merged that PR into this one so it's easier to read / can assure they go out together. (won't do anything until the option is modified)

JoshFerge commented 13 hours ago

rollout strategy: https://github.com/getsentry/sentry/issues/81065#issuecomment-2491767459

wedamija commented 11 hours ago

We also log https://github.com/getsentry/sentry/blob/d25bb2945c50ef50ccc5fbff57929e901b0b4fbf/src/sentry/tasks/post_process.py#L653-L659 as part of post process. I'm not sure if there are any datadog monitors or stat pages that will be affected by removing this. Might be worth adding this to the end of save event for transactions?

lynnagara commented 10 hours ago

We have to be very careful about how this impacts performance of save event. I think we should look into a couple of things first before making this change:

wedamija commented 10 hours ago

We have to be very careful about how this impacts performance of save event. I think we should look into a couple of things first before making this change:

  • Can we collect timings on transactions post processing is first before we move it into save event? I think that whether it is safe to do depends on the timing. If it's above, say, 10 milliseconds, i think we should maybe consider some other options?
  • Secondly, signals aren't allowed in the save event task, as it is too easy for additional work to be added there, and. Can we remove those?
  • The volume on this pipeline is expected to increase up to 4 fold in the coming months and we would need to understand what's needed to scale if save event transactions job gets slower

I agree we should time it to be safe, but I think it's unlikely that it really takes very long compared to https://github.com/getsentry/sentry/blob/9ee399c5365a1e44d98b92f617249eb9bd220385/src/sentry/event_manager.py#L2635 which we do in in save_transaction_events. This does a lot of relatively expensive analysis on all the details of the transaction.

JoshFerge commented 9 hours ago

we spent some time looking at the performance of the signals / transactions clustering on current telemetry in post_process, and both looked to be sub 10ms average, with no large outliers. this PR includes timing telemetry so we can monitor perf in S4S to confirm before any wider rollout takes place.

also confirmed that transaction clustering doesn't look at anything in clickhouse, it only uses redis.

i've split the option used to roll this out into two, so we can roll out by turning on transactions.do_post_process_in_save, confirming looks good (in this time we'll also do the same work in post_process, but doing things double here are fine)

after we confirm looks good, we can flip on transactions.dont_do_transactions_logic_in_post_process, which will stop doing the work in post_process.

codecov[bot] commented 8 hours ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

:white_check_mark: All tests successful. No failed tests found.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #81077 +/- ## ========================================== + Coverage 78.48% 80.36% +1.87% ========================================== Files 7215 7217 +2 Lines 319812 319896 +84 Branches 44045 20741 -23304 ========================================== + Hits 251012 257078 +6066 - Misses 62411 62426 +15 + Partials 6389 392 -5997 ```
lynnagara commented 8 hours ago

Currently there are 3 things that happen in post processing:

I think instead of splitting these under different options you can combine this under a single one transactions.do_post_process_in_save and only checking that option in save_event_transaction and not in post processing at all. I believe this works without any additional checks in post process because if you do everything in including deletion here then post process will never run anyway. The relevant line that prevents this from being re- post-processed is here: https://github.com/getsentry/sentry/blob/2e30c63b2e42fe2e6e11fa099bad827ade862668/src/sentry/tasks/post_process.py#L528.

I think there are a few benefits of this approach:

What do you think?

JoshFerge commented 6 hours ago

Currently there are 3 things that happen in post processing:

  • transactions clustering
  • signal + onboarding tasks
  • delete from rc-processing

I think instead of splitting these under different options you can combine this under a single one transactions.do_post_process_in_save and only checking that option in save_event_transaction and not in post processing at all. I believe this works without any additional checks in post process because if you do everything in including deletion here then post process will never run anyway. The relevant line that prevents this from being re- post-processed is here:

https://github.com/getsentry/sentry/blob/2e30c63b2e42fe2e6e11fa099bad827ade862668/src/sentry/tasks/post_process.py#L528

. I think there are a few benefits of this approach:

  • Consistency. You get exactly once processing without much work. You do not have a minute or two when this option gets flipped where the events that are in the middle of the pipeline may skip this processing
  • Less complexity
  • All combinations of options are valid. There is only one option and turning it up and down is harmless. It's easy for anyone (like SRE) to flip back in an incident. The other way requires more knowledge of the code to ensure an invalid combination isn't picked.

What do you think?

i've gone ahead and implemented this -- it's much cleaner. thank you for the suggestion. closing the follow up PR.

JoshFerge commented 5 hours ago

updated rollout plan: https://github.com/getsentry/sentry/issues/81065#issuecomment-2491767459