cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.51k stars 3.7k forks source link

storage,tsdb: incorrect graphs for storage byte volumes #105923

Open jbowens opened 1 year ago

jbowens commented 1 year ago

I believe the DB Console Storage dashboard is presenting storage metrics incorrectly. It appears to downsample using averaging (maybe a consequence of these being modeled as gauges instead of counters within the Cockroach metric system).

From the Storage dashboard:

Screenshot 2023-06-30 at 2 53 31 PM

Using sum downsampling:

Screenshot 2023-06-30 at 2 53 40 PM

Related to #102729, #75523, #99922.

Jira issue: CRDB-29274

jbowens commented 1 year ago

I'm second-guessing myself here. I'll run an experiment again this week and write up something definitive.

jbowens commented 11 months ago

I suspect the unit should just be MiB/s, because these values are scaled to be per-second. Can someone from @cockroachdb/cluster-observability confirm?

zachlite commented 11 months ago

@jbowens, yeah quantity/second seems right. I'd expect area under the curve for some interval to give me a value in bytes. This should be the case for Compactions, Ingestions, and Time Series Bytes Written as well.

jbowens commented 11 months ago

I'd expect area under the curve for some interval to give me a value in bytes.

Yeah, the current ambiguous element is what is the unit of time. Previously my expectation was that if a point represented a 10-second span, its value represented the cumulative bytes flushed (or compacted, ingested, etc) over that 10 second time period. To sum the area under the curve, all you need to do is add up all the points. But that does not appear to be what's shown.

Do you know where this value gets average to a per-second rate?

zachlite commented 11 months ago

Then it sounds like we shouldn't be requesting this timeseries data as an average rate, but just request it as a sum: https://github.com/cockroachdb/cockroach/blob/3eb6d6acd625871b7ce6409f199648132c734545/pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/storage.tsx#L209

https://github.com/cockroachdb/cockroach/blob/d6fb82dc5fd7f2657c5dee856076723fc716f8f0/pkg/ts/tspb/timeseries.proto#L88-L109

jbowens commented 11 months ago

Excluding the nonNegativeRate would produce a monotonically increasing value of the total bytes flushed since the process began. We want a rate, we just want the rate to have some known time interval. It's unclear to me what time interval we're seeing now, and if it is scaled to per-second or not.

nicktrav commented 11 months ago

Questions for obs-inf: what are the units.

ericharmeling commented 11 months ago

We want a rate, we just want the rate to have some known time interval. It's unclear to me what time interval we're seeing now, and if it is scaled to per-second or not.

Apologies for the delay on this!

It appears to downsample using averaging

Currently, the frontend constructs a query that uses the MAX downsampler if none is explicitly provided: https://github.com/cockroachdb/cockroach/blob/master/pkg/ui/workspaces/db-console/src/views/shared/containers/metricDataProvider/index.tsx#L56

I think we want to reconsider that default behavior: https://github.com/cockroachdb/cockroach/issues/106215.

It's unclear to me what time interval we're seeing now, and if it is scaled to per-second or not.

For graph time intervals under 30 minutes, I believe that we use a 10s resolution. And I think nonNegativeRate provides the rate of change over whatever the resolution time, in terms of individual, downsampled observations: https://github.com/cockroachdb/cockroach/blob/master/pkg/ts/query.go#L425. So while it's technically not a recorded per-second rate of change, it can be expressed as "per second", as it's a first-order derivative between two observations over the sample period.

jbowens commented 10 months ago

Thanks @ericharmeling! Sorry, I'm still a bit mixed up here.

Say we use a 10s resolution, and we have a data point X1 at time T and a data point X2 at T+10s. With the nonNegativeRate I was expecting to see the point at T+10s to be X2-X1, the delta between the two observations.

Looking at the code you linked, it looks like we compute X2-X1, then we divide by the difference between the timestamps, scaled to the samplePeriod. Is the samplePeriod 1 second here, so the rate of changed is scaled to per-second?

ericharmeling commented 10 months ago

Looking at the code you linked, it looks like we compute X2-X1, then we divide by the difference between the timestamps, scaled to the samplePeriod. Is the samplePeriod 1 second here, so the rate of changed is scaled to per-second?

Ahh - okay. I think I misinterpreted the calculation of rateOfChange. And I was reading "X2-X1/samplePeriod". But that's incorrect. samplePeriod is 10s, which should be the same as the difference between time T and T+10. So, the unit of change between the observations should be "per 10 seconds" as calculated by rateOfChange. I believe the sample period scaling is then happening on the output of rateOfChange here: https://github.com/cockroachdb/cockroach/blob/2ec2b61627cdadc8b269e85d6b4a64c669b6c140/pkg/ts/query.go#L709 This per-second rate is then returned by TSDB to the UI. So the UI values are per-second.

jbowens commented 9 months ago

Thanks @ericharmeling! I'll update the DB console labels.