Open JoshFerge opened 1 day ago
What are the implications of transactions sampling not happening exactly once?
Won't this potentially sample some messages in flight twice or not at all depending on the direction of option change
I think https://github.com/getsentry/sentry/pull/81079 could probably be part of this same PR, otherwise all this option does is double sample.
I think https://github.com/getsentry/sentry/pull/81079 could probably be part of this same PR, otherwise all this option does is double sample.
@lynnagara the goal of the PRs is to be reviewed separately, not to be deployed separately. i would never turn the option on without the other one deployed
What are the implications of transactions sampling not happening exactly once? Won't this potentially sample some messages in flight twice or not at all depending on the direction of option change
from reading about the transactions sampler, i believe it is a feature that helps with parameterizing high cardinality transaction names. i don't believe double sampling (wouldn't happen as i've implemented)
or not having samples (would be the case for a few moments for this deploy) would have a large effect on the product.
We originally put the "record transaction" logic in post process to move it off the critical path. But if that task is going to disappear completely, this makes sense.
Yeah, that makes. Unfortunately at this point the load we're putting on RabbitMQ and potential instability it causes isn't worth having a whole other task for transactions. I think once we have more durable execution with task workers, it's certainly worth investigating breaking up again.
Looks good with #81079 so that we aren't double sampling.
merged that PR into this one so it's easier to read / can assure they go out together. (won't do anything until the option is modified)
rollout strategy: https://github.com/getsentry/sentry/issues/81065#issuecomment-2491767459
We also log https://github.com/getsentry/sentry/blob/d25bb2945c50ef50ccc5fbff57929e901b0b4fbf/src/sentry/tasks/post_process.py#L653-L659 as part of post process. I'm not sure if there are any datadog monitors or stat pages that will be affected by removing this. Might be worth adding this to the end of save event for transactions?
We have to be very careful about how this impacts performance of save event. I think we should look into a couple of things first before making this change:
We have to be very careful about how this impacts performance of save event. I think we should look into a couple of things first before making this change:
- Can we collect timings on transactions post processing is first before we move it into save event? I think that whether it is safe to do depends on the timing. If it's above, say, 10 milliseconds, i think we should maybe consider some other options?
- Secondly, signals aren't allowed in the save event task, as it is too easy for additional work to be added there, and. Can we remove those?
- The volume on this pipeline is expected to increase up to 4 fold in the coming months and we would need to understand what's needed to scale if save event transactions job gets slower
I agree we should time it to be safe, but I think it's unlikely that it really takes very long compared to https://github.com/getsentry/sentry/blob/9ee399c5365a1e44d98b92f617249eb9bd220385/src/sentry/event_manager.py#L2635 which we do in in save_transaction_events
. This does a lot of relatively expensive analysis on all the details of the transaction.
we spent some time looking at the performance of the signals / transactions clustering on current telemetry in post_process, and both looked to be sub 10ms average, with no large outliers. this PR includes timing telemetry so we can monitor perf in S4S to confirm before any wider rollout takes place.
also confirmed that transaction clustering doesn't look at anything in clickhouse, it only uses redis.
i've split the option used to roll this out into two, so we can roll out by turning on transactions.do_post_process_in_save
, confirming looks good (in this time we'll also do the same work in post_process, but doing things double here are fine)
after we confirm looks good, we can flip on transactions.dont_do_transactions_logic_in_post_process
, which will stop doing the work in post_process.
All modified and coverable lines are covered by tests :white_check_mark:
:white_check_mark: All tests successful. No failed tests found.
Currently there are 3 things that happen in post processing:
I think instead of splitting these under different options you can combine this under a single one transactions.do_post_process_in_save
and only checking that option in save_event_transaction and not in post processing at all. I believe this works without any additional checks in post process because if you do everything in including deletion here then post process will never run anyway. The relevant line that prevents this from being re- post-processed is here: https://github.com/getsentry/sentry/blob/2e30c63b2e42fe2e6e11fa099bad827ade862668/src/sentry/tasks/post_process.py#L528.
I think there are a few benefits of this approach:
What do you think?
Currently there are 3 things that happen in post processing:
- transactions clustering
- signal + onboarding tasks
- delete from rc-processing
I think instead of splitting these under different options you can combine this under a single one
transactions.do_post_process_in_save
and only checking that option in save_event_transaction and not in post processing at all. I believe this works without any additional checks in post process because if you do everything in including deletion here then post process will never run anyway. The relevant line that prevents this from being re- post-processed is here:. I think there are a few benefits of this approach:
- Consistency. You get exactly once processing without much work. You do not have a minute or two when this option gets flipped where the events that are in the middle of the pipeline may skip this processing
- Less complexity
- All combinations of options are valid. There is only one option and turning it up and down is harmless. It's easy for anyone (like SRE) to flip back in an incident. The other way requires more knowledge of the code to ensure an invalid combination isn't picked.
What do you think?
i've gone ahead and implemented this -- it's much cleaner. thank you for the suggestion. closing the follow up PR.
updated rollout plan: https://github.com/getsentry/sentry/issues/81065#issuecomment-2491767459
first PR of https://github.com/getsentry/sentry/issues/81065
no behavior changes until option is turned on. need to change post_process as well before we can turn the option on. equivalent of https://github.com/getsentry/sentry/blob/b84d3d9238ab75822b076d16de1111c3035fe073/src/sentry/tasks/post_process.py#L614 in post_process