metrics: make certain `sys` network/disk stats counters instead of gauges.

abarganier commented 1 year ago

Is your feature request related to a problem? Please describe. In the RuntimeStatsSampler, we gather various system metrics from the host on things like memory, disk, network, cpu, etc.

A few of these metrics, such as sys_host_disk_[read|write]_bytes and sys_host_net_[send|recv]_bytes, are cumulative monotonically increasing values. However, these metrics are exported as GAUGE metrics. This makes it difficult to calculate common functions like rate-of-change against the metrics, as reported by one customer struggling to do so in NewRelic.

Describe the solution you'd like Identify gauges in the RuntimeStatusSampler that would be better represented as a counter, and migrate the metrics over.

Note that our metric.Counter API only allows to .Inc(int64)/Count(), while the raw measurements are cumulative. Therefore, we will need to compute the delta since the last measurement when updating the metric values. This is already done in some areas for other purposes, so we can likely piggy-back off of that work.

Additional context Be sure to consider the impact on downstream metric systems of changing the metric types. To my knowledge, this would lead to a new & separate timeseries in systems like Prometheus, breaking metric continuity. But how will two metrics with the same name but different types fare in these systems? We should test the migration out by running a Prometheus instance scraping a CockroachDB node with the gauge metrics, and then restart the node with the metrics changed over to counters and see the impact when viewing the metric in Grafana.

Be sure to backport this to ~~v22.1+~~ v22.2+. The specific customer experiencing the issue in NewRelic is on v22.1.18, but plans on upgrading to v22.2 soon.

Jira issue: CRDB-27619

Epic CRDB-32137

AndrewSouthpaw commented 1 year ago

Thanks for turning this into a public issue, @abarganier! FWIW we're just about to switch onto v22.2 so the backport is less important at least to us.

abarganier commented 1 year ago

@AndrewSouthpaw thanks for the tip, and thank you for bringing this to our attention! We can update the backport strategy to v22.2+ in that case.

thtruo commented 1 year ago

@dhartunian mentioned https://github.com/cockroachdb/cockroach/issues/75523

jbowens commented 1 year ago

This is related to #104114. If we rework these metrics, I think we should consider instead exporting 1 disk metric per store, mapping the device to the store on start up.

cockroachdb / cockroach

metrics: make certain `sys` network/disk stats counters instead of gauges. #102729