getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
38.67k stars 4.14k forks source link

Transaction Duration Metrics don't match Measurement Metrics #68747

Open kylannjohnson opened 5 months ago

kylannjohnson commented 5 months ago

Environment

SaaS (https://sentry.io/)

Steps to Reproduce

Consider two transactions, one un-scoped and the other scoped. (Let's call them Transaction A and B). Transaction A's duration is added as a measurement to Transaction B because we'd like to just track the duration of this operation instead of using a transaction on it.

The metrics don't seem to match and my question is whether they should or not.

In the dashboard below the left panels are showing various functions on transaction.duration of Transaction A. The lowest panel is just a list of the duration. The panels on the right are showing the same functions but applied to a measurement sent with Transaction B (the scoped one).

Expected Result

I expect there's some interpolation going on in both cases, but I would expect the same results, given the same, small set of data. Should a percentile calculation done on a measurement be the same as a duration of the same amount? All durations are known and seem to both be in the data set.

If there is some variance, what is the accuracy of it? e.g. +/- 1%

Actual Result

Screenshot 2024-04-08 at 10 21 21 AM

Product Area

Dashboards

Link

No response

DSN

No response

Version

No response

┆Issue is synchronized with this Jira Improvement by Unito

getsantry[bot] commented 5 months ago

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] commented 5 months ago

Routing to @getsentry/product-owners-performance for triage ⏲️

k-fish commented 5 months ago

Hey there, thanks for writing in, I think I understand what you're asking, but before we get into how the percentiles etc. work can you possibly show the events durations and the measurement in discover (or if you forgo using aggregates in dashboards and add event-id, the table should work) and list out each event? just to make sure the underlying numbers are the same?

eg. something like this (ignore that the measurement here is num_of_spans and replace it with the measurement you're using) Screenshot 2024-04-15 at 4 31 31 PM

Also do you have any filters on these? I see count() of 8 for measurements and transactions, so if that's correct it should be showing the same number of data points.

kylannjohnson commented 5 months ago

Sorry, but the durations were obscured by the excel sheet and they matched. They were listed in tables on the bottom with the only filter on the left being title:TransactionA and the right being title:TransactionB. Then the columns were for transaction.duration and measurement.foo respectively

kylannjohnson commented 5 months ago

I was able to zero in on the same dataset. I did add a dashboard filter so that all tiles use the same release (the dev release i was working on) image

k-fish commented 5 months ago

Hmm those look correct, you should be getting the same percentile for the same period, although there can be some small differences since we use an approximate quantile function so when there are millions (or billions) of data points it doesn't take a long time.

Can you possibly share the id for this dashboard (sentry.io/dashboard/<id here>/)? I can check on the backend and maybe figure out what's up.

kylannjohnson commented 4 months ago

Sorry for the delay. Let me make sure I can disclose that

kylannjohnson commented 4 months ago

Can you possibly share the id for this dashboard (sentry.io/dashboard/<id here>/)? I can check on the backend and maybe figure out what's up.

ID: 85476

getsantry[bot] commented 4 months ago

Routing to @getsentry/product-owners-ddm for triage ⏲️

getsantry[bot] commented 4 months ago

Routing to @getsentry/product-owners-performance for triage ⏲️

gggritso commented 4 months ago

@kylannjohnson thanks for providing the dashboard ID 👍🏻 We're taking a look, and will get back to you early next week–some of the folks working on this are away at the moment. Thank you for your patience!

wmak commented 4 months ago

Hmm this seems to be an issue with how we're storing this specific metrics data, i've reached out to the metrics team to help debug this. I'll let you know when I hear back from them.

wmak commented 4 months ago

Hey @kylannjohnson thanks for raising this you've helped us discover a pretty gnarly bug. This is our current understanding of what's happening

  1. On April 8th, login_to_vitals measurements were sent to Sentry with a mix of units, some milliseconds, others seconds
  2. Currently Sentry will always pick the first unit it sees and assume all the other values for a given measurement will use the same unit
  3. This means that resulting aggregations are inconsistent with the sample list you have (eg. the 7.74s item in your original screenshot, is actually 7.74ms, you can confirm this if you open the event json for that in the Trace View) image

What this means for you hopefully is

  1. Immediately - As long as you stick with a single unit, the data will be correct now
  2. Short Term - We're going to fix the way we display samples to take multiple units into account
  3. Long Term - We're going to introduce a way to handle multiple units more gracefully on storage so that when we can we'll convert the units to a consistent value
kylannjohnson commented 3 months ago

@wmak Is there any way to delete the current setting for this measurement (or any measurement)? In this case, I think we would be fine starting from scratch because we're trying to move off of transaction.duration to a measurement.

Is the selection of the first unit project-based? Meaning, if we had a "staging project" that we tested on before moving to a production project, would that avoid this?

wmak commented 2 months ago

👋 My assumption here is we can't but I've reached out to @matejminar who has more information regarding the storage of measurements who can confirm for you.

wmak commented 2 months ago

Alternatively, if you just want to get the immediate solution you can rename the measurement too which should work immediately.

As well yes the unit determination is based on the currently selected projects

matejminar commented 2 months ago

@wmak is right, we can't delete measurements. The best bet would be to use a different name if you'd like to start fresh.

slightly unrelated side note: we don't support deletion in custom metrics either; we just support "disabling" which blocks ingestion and hides the metric from UI, but unfortunately, that's not available with custom measurements https://docs.sentry.io/product/explore/metrics/metrics-settings/#emitted-metrics

kylannjohnson commented 2 months ago

Ok, we'll explore versioning our metric/measurement names. Looking forward to the fix, too.

edit: I just wanted to be sure I understand. If we send a measurement to Project1 and then to Project2, is that considered two different measurements? We do have a smaller project just for experimentation and non-production data.

wmak commented 2 months ago

I just wanted to be sure I understand. If we send a measurement to Project1 and then to Project2, is that considered two different measurements?

Not really, its more that per request we look at the selected projects and try our best to figure out what the user means when you request a measurement.

If both projects are selected, the unit from either project may be used.

getsantry[bot] commented 2 months ago

Routing to @getsentry/product-owners-ddm for triage ⏲️

kylannjohnson commented 2 months ago

Hey all, any updates here?

getsantry[bot] commented 1 month ago

Routing to @getsentry/product-owners-dashboards for triage ⏲️

ale-cota commented 1 month ago

Hi @kylannjohnson, Sorry for the delayed reply here. Have you managed to implement the short term fix we suggested to rename the measurements and start fresh with a single unit? Are you still experiencing issues in your dashboard?

We don’t have an ETA yet for the longterm fix to handle and merge multiple units on the backend. This is currently on our backlog, but if you're still having issues, we'll try to find a solution. Thanks!

kylannjohnson commented 1 month ago

Hey @ale-cota ,

Thanks for the response. Yes, we have decided to version our measurement names going forward. It seems like this will address our issue in our local tests on a separate dashboard.

Perhaps there's already a way to do this, but some kind of Measurement Summary could be a way to indicate what raw values came in to Sentry or what unit they are being assigned. I can see why the implementation is "sticky", but versioning a measurement name also isn't great. I'm sure there's a great solution to this product level issue, but we'll keep an eye out as we grow our usage of measurements in the near future.

ale-cota commented 1 month ago

Thanks for confirming that versioning for the names at least addresses the issue for now. We'll update you once we manage to schedule the investigation work for merging/summarizing multiple units.