Open jbowens opened 1 year ago
I'm second-guessing myself here. I'll run an experiment again this week and write up something definitive.
I suspect the unit should just be MiB/s, because these values are scaled to be per-second. Can someone from @cockroachdb/cluster-observability confirm?
@jbowens, yeah quantity/second seems right. I'd expect area under the curve for some interval to give me a value in bytes. This should be the case for Compactions, Ingestions, and Time Series Bytes Written as well.
I'd expect area under the curve for some interval to give me a value in bytes.
Yeah, the current ambiguous element is what is the unit of time. Previously my expectation was that if a point represented a 10-second span, its value represented the cumulative bytes flushed (or compacted, ingested, etc) over that 10 second time period. To sum the area under the curve, all you need to do is add up all the points. But that does not appear to be what's shown.
Do you know where this value gets average to a per-second rate?
Then it sounds like we shouldn't be requesting this timeseries data as an average rate, but just request it as a sum: https://github.com/cockroachdb/cockroach/blob/3eb6d6acd625871b7ce6409f199648132c734545/pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/storage.tsx#L209
Excluding the nonNegativeRate
would produce a monotonically increasing value of the total bytes flushed since the process began. We want a rate, we just want the rate to have some known time interval. It's unclear to me what time interval we're seeing now, and if it is scaled to per-second or not.
Questions for obs-inf: what are the units.
We want a rate, we just want the rate to have some known time interval. It's unclear to me what time interval we're seeing now, and if it is scaled to per-second or not.
Apologies for the delay on this!
It appears to downsample using averaging
Currently, the frontend constructs a query that uses the MAX downsampler if none is explicitly provided: https://github.com/cockroachdb/cockroach/blob/master/pkg/ui/workspaces/db-console/src/views/shared/containers/metricDataProvider/index.tsx#L56
I think we want to reconsider that default behavior: https://github.com/cockroachdb/cockroach/issues/106215.
It's unclear to me what time interval we're seeing now, and if it is scaled to per-second or not.
For graph time intervals under 30 minutes, I believe that we use a 10s resolution. And I think nonNegativeRate
provides the rate of change over whatever the resolution time, in terms of individual, downsampled observations: https://github.com/cockroachdb/cockroach/blob/master/pkg/ts/query.go#L425. So while it's technically not a recorded per-second rate of change, it can be expressed as "per second", as it's a first-order derivative between two observations over the sample period.
Thanks @ericharmeling! Sorry, I'm still a bit mixed up here.
Say we use a 10s resolution, and we have a data point X1 at time T and a data point X2 at T+10s. With the nonNegativeRate
I was expecting to see the point at T+10s to be X2-X1, the delta between the two observations.
Looking at the code you linked, it looks like we compute X2-X1, then we divide by the difference between the timestamps, scaled to the samplePeriod
. Is the samplePeriod
1 second here, so the rate of changed is scaled to per-second?
Looking at the code you linked, it looks like we compute X2-X1, then we divide by the difference between the timestamps, scaled to the samplePeriod. Is the samplePeriod 1 second here, so the rate of changed is scaled to per-second?
Ahh - okay. I think I misinterpreted the calculation of rateOfChange
. And I was reading "X2-X1/samplePeriod". But that's incorrect. samplePeriod
is 10s, which should be the same as the difference between time T and T+10. So, the unit of change between the observations should be "per 10 seconds" as calculated by rateOfChange. I believe the sample period scaling is then happening on the output of rateOfChange
here: https://github.com/cockroachdb/cockroach/blob/2ec2b61627cdadc8b269e85d6b4a64c669b6c140/pkg/ts/query.go#L709 This per-second rate is then returned by TSDB to the UI. So the UI values are per-second.
Thanks @ericharmeling! I'll update the DB console labels.
I believe the DB Console Storage dashboard is presenting storage metrics incorrectly. It appears to downsample using averaging (maybe a consequence of these being modeled as gauges instead of counters within the Cockroach metric system).
From the Storage dashboard:
Using sum downsampling:
Related to #102729, #75523, #99922.
Jira issue: CRDB-29274