Open abarganier opened 1 year ago
Thanks for turning this into a public issue, @abarganier! FWIW we're just about to switch onto v22.2 so the backport is less important at least to us.
@AndrewSouthpaw thanks for the tip, and thank you for bringing this to our attention! We can update the backport strategy to v22.2+ in that case.
@dhartunian mentioned https://github.com/cockroachdb/cockroach/issues/75523
This is related to #104114. If we rework these metrics, I think we should consider instead exporting 1 disk metric per store, mapping the device to the store on start up.
Is your feature request related to a problem? Please describe. In the RuntimeStatsSampler, we gather various system metrics from the host on things like memory, disk, network, cpu, etc.
A few of these metrics, such as
sys_host_disk_[read|write]_bytes
andsys_host_net_[send|recv]_bytes
, are cumulative monotonically increasing values. However, these metrics are exported asGAUGE
metrics. This makes it difficult to calculate common functions like rate-of-change against the metrics, as reported by one customer struggling to do so in NewRelic.Describe the solution you'd like Identify gauges in the RuntimeStatusSampler that would be better represented as a counter, and migrate the metrics over.
Note that our
metric.Counter
API only allows to.Inc(int64)
/Count()
, while the raw measurements are cumulative. Therefore, we will need to compute the delta since the last measurement when updating the metric values. This is already done in some areas for other purposes, so we can likely piggy-back off of that work.Additional context Be sure to consider the impact on downstream metric systems of changing the metric types. To my knowledge, this would lead to a new & separate timeseries in systems like Prometheus, breaking metric continuity. But how will two metrics with the same name but different types fare in these systems? We should test the migration out by running a Prometheus instance scraping a CockroachDB node with the
gauge
metrics, and then restart the node with the metrics changed over tocounter
s and see the impact when viewing the metric in Grafana.Be sure to backport this to
v22.1+v22.2+. The specific customer experiencing the issue in NewRelic is on v22.1.18, but plans on upgrading to v22.2 soon.Jira issue: CRDB-27619
Epic CRDB-32137