aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.28k stars 741 forks source link

Counters reported as Gauges in Prometheus metrics #3031

Open danielgblanco opened 1 month ago

danielgblanco commented 1 month ago

What happened: Some of the Prometheus metrics exported by the VPC CNI plugin are defined with inaccurate metric types. For example:

https://github.com/aws/amazon-vpc-cni-k8s/blob/27ce1362636567592f006b987f3820c6b0fef55e/utils/prometheusmetrics/prometheusmetrics.go#L64

This metric (awscni_add_ip_req_count) is exported as a gauge but it has cumulative incremental values. In fact, it seems that it's used as a counter in:

https://github.com/aws/amazon-vpc-cni-k8s/blob/27ce1362636567592f006b987f3820c6b0fef55e/pkg/ipamd/rpc_handler.go#L70

It seems that awscni_del_ip_req_count is correctly exported as a counter.

I probably don't have enough context on this to make a judgement call. However, I think there are probably more Gauges that are operating as Counters.

Attach logs N/A

What you expected to happen: I'd expect metrics to follow the semantic conventions defined in https://prometheus.io/docs/concepts/metric_types/

How to reproduce it (as minimally and precisely as possible): Using Prometheus exporters.

Anything else we need to know?: This may not be a critical issues if systems use Prometheus as the backend. However, it becomes a problem when Prometheus metrics are transformed into other representations. For example, OpenTelemetry Collectors will read this as a Gauge and that gives the aggregation a different meaning (e.g. one can change temporality of counters from cumulative to delta or viceversa).

Environment:

orsenthil commented 1 month ago

@danielgblanco - could you verify this with the latest version of VPC CNI. This was fixed in the recent versions of CNI 1.18.3.

orsenthil commented 1 month ago

This is still present as Gauge in master

https://github.com/aws/amazon-vpc-cni-k8s/blob/master/utils/prometheusmetrics/prometheusmetrics.go#L64

We need to fix this to use as Counter.

danielgblanco commented 1 month ago

Sorry I've been on PTO, thanks for the follow up.

hbhasker commented 1 month ago

Looks like its not just the AddIPCnt metric that changed incorrectly to a gauge. I think this was done incorrectly in https://github.com/aws/amazon-vpc-cni-k8s/commit/2ac9e0ae1eed0af0c6a5db8e481fd73dc796feb3#diff-6c65a620b5206565cbd61b3390a33e02146dac52f62735909564c6b963127968L182