GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
191 stars 89 forks source link

cadvisor-metrics.yaml endpoint causing ClusterNodeMonitoring pool name parsing failure #1097

Closed deekthesqueak closed 1 month ago

deekthesqueak commented 1 month ago

I enabled kublet/cadvisor on my test GCP cluster (v1.29.6-gke1038001) with GCP managed Prometheus (0.12.0-gke.3). Once enabled the following error would repeat when viewing the pods with

kubectl logs -f -ngmp-system -lapp.kubernetes.io/part-of=gmp

{"level":"error","ts":"2024-08-01T22:05:05Z","msg":"poll and update","error":"invalid ClusterNodeMonitoring scrape pool format \"ClusterNodeMonitoring/gmp-kubelet-cadvisor/metrics/cadvisor\"","stacktrace":"github.com/GoogleCloudPlatform/prometheus-engine/pkg/operator.(*targetStatusReconciler).Reconcile\n\t/app/pkg/operator/target_status.go:176\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

From this error message invalid ClusterNodeMonitoring scrape pool format \"ClusterNodeMonitoring/gmp-kubelet-cadvisor/metrics/cadvisor\" it can be traced to pkg/operator/endpoint_status_builder.go line 166

    case "ClusterNodeMonitoring":
        if len(split) != 3 {
            return scrapePool{}, fmt.Errorf("invalid ClusterNodeMonitoring scrape pool format %q", pool)
        }
        return getClusterScopedScrapePool(pool, split), nil

The gmp-kubelet-cadvisor ClusterNodeMonitoring in https://github.com/GoogleCloudPlatform/prometheus-engine/blob/v0.12.0/examples/cadvisor-metrics.yaml will always fail this check since the path endpoint (/metrics/cadvisor) will always return 4 parts.

This doesn't happen in the similar https://github.com/GoogleCloudPlatform/prometheus-engine/blob/v0.12.0/examples/kubelet-metrics.yaml because the endpoint only has a single forward slash (/metrics)

pintohutch commented 1 month ago

Hey @deekthesqueak - thanks for raising this. This is indeed a bug. We'll work on a patch and link this to the PR.