Stackdriver / stackdriver-prometheus-sidecar

A sidecar for the Prometheus server that can send metrics to Stackdriver.
https://cloud.google.com/monitoring/kubernetes-engine/prometheus
Apache License 2.0
121 stars 43 forks source link

Add information to README about "too many labels" and possible solutions #278

Open alxndr42 opened 3 years ago

alxndr42 commented 3 years ago

I've deployed the sidecar on a cluster with Istio/Knative, and some metrics aren't showing up in the Metrics Explorer. The sidecar log didn't reveal anything at first, only after turning on debug logging did I see a ton of too many labels errors. (This in itself seems like a bug, or worthy of mentioning in the README.) According to the error messages, the problematic metrics have almost 40 labels, thanks to Istio/Knative.

So I tried to configure Prometheus via metric_relabel_configs to drop some labels, but so far many labels cannot be removed without causing target not found errors in the sidecar log (and metrics still not showing up). I'm not sure why i.e. security_istio_io_tlsMode is a critical label to have.

Do you have a suggested approach to reducing the amount of labels and keeping the sidecar happy?

alxndr42 commented 3 years ago

Since dropping labels in Prometheus seems to create new problems, I've added customizable label filtering on finalLabels to seriesCache.refresh(). Seems to work fine in my local build of the sidecar. Would you be interested in receiving a PR for this?

I am using the config file to define filters, i.e.:

label_filters:
  - metric: "^istio_(request|response|tcp).*"
    allow:
      - app
      - destination_canonical_service
      - instance
      - job
      - kubernetes_namespace
      - kubernetes_pod_name
      - response_code
      - source_canonical_service
salanfe commented 3 years ago

Hello ! I'm facing the exact same issue !

Here are the steps to reproduce

  1. create a standard GKE cluster, I'm using 1.19.9-gke.1400. And follow istio setup guide for GKE.
  2. install Istio Operator, and enable sidecar injection in the default namespace kubectl label namespace default istio-injection=enabled
  3. deploy the bookinfo app
  4. deploy prometheus
  5. generate some dummy traffic on the hostname:port/productpage
  6. check that the metrics are available in prometheus, e.g. with istioctl dashboard prometheus, and look for the metric istio_requests_total (for example)

so far so good, this is standard istio installation, with some dummy istio-sidecar metrics from the bookinfo services and a prometheus instance. The istio metrics are visible in prometheus, scraping is working as expected. Then the goal is to deploy the stackdriver-prometheus-sidecar as documented in Using Prometheus

A service account for the sidecar is created, and the service account key exported as a kubernetes secret, e.g.

$ gcloud iam service-accounts create prometheus-stackdriver --display-name prometheus-stackdriver-service-account

$ PROMETHEUS_STACKDRIVER_SA_EMAIL=$(gcloud iam service-accounts list --filter="displayName:prometheus-stackdriver-service-account" --format='value(email)')

$ gcloud projects add-iam-policy-binding ${PROJECT_ID} --role roles/monitoring.metricWriter --member serviceAccount:${PROMETHEUS_STACKDRIVER_SA_EMAIL}

$ gcloud iam service-accounts keys create prometheus-stackdriver-service-account.json --iam-account ${PROMETHEUS_STACKDRIVER_SA_EMAIL}

$ kubectl -n istio-system create secret generic prometheus-stackdriver-service-account --from-file=key.json=prometheus-stackdriver-service-account.json

then, the following patch is applied to the prometheus (istio) deployment to add the stackdriver-prometheus-sidecar

# prometheus-patch.yaml
spec:
  template:
    spec:
      volumes:
        - name: google-cloud-key
          secret:
            secretName: prometheus-stackdriver-service-account
      containers:
        - name: sidecar
          image: gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar:0.8.2
          imagePullPolicy: Always
          args:
            - "--stackdriver.project-id=XXX"
            - "--prometheus.wal-directory=/data/wal"
            - "--stackdriver.kubernetes.location=XXX"
            - "--stackdriver.kubernetes.cluster-name=XXX"
            - "--log.level=debug"
          ports:
            - name: sidecar
              containerPort: 9091
          env:
            - name: GOOGLE_APPLICATION_CREDENTIALS
              value: /var/secrets/google/key.json
          volumeMounts:
            - name: storage-volume
              mountPath: /data
            - name: google-cloud-key
              mountPath: /var/secrets/google

with e.g.

$ kubectl -n istio-system patch deployment prometheus --type strategic --patch="$(cat prometheus-patch.yaml)" 

now looking at the logs of the stackdriver-prometheus-sidecar, I get, among others, those error messages

level=debug ts=2021-06-02T19:16:29.026Z caller=series_cache.go:395 component="Prometheus reader" msg="too many labels" labels="{__name__=\"istio_request_duration_milliseconds_bucket\",app=\"productpage\",connection_security_policy=\"unknown\",destination_app=\"reviews\",destination_canonical_revision=\"v3\",destination_canonical_service=\"reviews\",destination_cluster=\"Kubernetes\",destination_principal=\"spiffe://cluster.local/ns/default/sa/bookinfo-reviews\",destination_service=\"reviews.default.svc.cluster.local\",destination_service_name=\"reviews\",destination_service_namespace=\"default\",destination_version=\"v3\",destination_workload=\"reviews-v3\",destination_workload_namespace=\"default\",instance=\"10.72.1.9:15020\",istio_io_rev=\"default\",job=\"kubernetes-pods\",kubernetes_namespace=\"default\",kubernetes_pod_name=\"productpage-v1-6b746f74dc-mgblb\",le=\"3600000\",pod_template_hash=\"6b746f74dc\",reporter=\"source\",request_protocol=\"http\",response_code=\"200\",response_flags=\"-\",security_istio_io_tlsMode=\"istio\",service_istio_io_canonical_name=\"productpage\",service_istio_io_canonical_revision=\"v1\",source_app=\"productpage\",source_canonical_revision=\"v1\",source_canonical_service=\"productpage\",source_cluster=\"Kubernetes\",source_principal=\"spiffe://cluster.local/ns/default/sa/bookinfo-productpage\",source_version=\"v1\",source_workload=\"productpage-v1\",source_workload_namespace=\"default\",version=\"v1\"}"

At this point, just like @7adietri, I'm looking at a light way to transform those istio metrics so that they can be pushed to cloud monitoring. Ideally, the solution should be easily maintainable. From the Quotas and limits page for custom metrics, it looks like the maximum number of labels is 10.

alxndr42 commented 3 years ago

@salanfe Check out PR https://github.com/Stackdriver/stackdriver-prometheus-sidecar/pull/283

We're currently deploying a custom build of the sidecar image, but I hope the PR will be accepted, so that we can switch back to the official image.

Naterd commented 3 years ago

I have had success in the past as a GCP customer that opening a support case asking for this PR to be reviewed and merged has helped get a response as I am currently running into this exact issue.

Opening a support case now :)

igorpeshansky commented 3 years ago

@7adietri, could you please be more specific about the issues you ran into with metric_relabel_configs (what was your exact configuration, what were the exact errors in the sidecar log, what environment you were running in)? Before we consider adding new functionality to the sidecar, it would help us to understand why the existing solution (metric relabeling on the Prometheus side) does not cover your use case. Thanks.

alxndr42 commented 3 years ago

@igorpeshansky Ok, so this is a typical error message (only visible at debug level) when using the sidecar in a cluster with Knative/Istio:

level=debug ts=2021-07-29T11:13:36.704Z caller=series_cache.go:395 component="Prometheus reader" msg="too many labels" labels="{__name__=\"istio_requests_total\",app=\"test1-00001\",connection_security_policy=\"mutual_tls\",destination_app=\"test1-00001\",destination_canonical_revision=\"test1-00001\",destination_canonical_service=\"test1\",destination_principal=\"spiffe://cluster.local/ns/default/sa/default\",destination_service=\"test1-00001-private.default.svc.cluster.local\",destination_service_name=\"test1-00001-private\",destination_service_namespace=\"default\",destination_version=\"unknown\",destination_workload=\"test1-00001-deployment\",destination_workload_namespace=\"default\",instance=\"10.76.1.13:15020\",istio_io_rev=\"default\",job=\"kubernetes-pods\",kubernetes_namespace=\"default\",kubernetes_pod_name=\"test1-00001-deployment-844f655ddc-jsbl2\",pod_template_hash=\"844f655ddc\",reporter=\"destination\",request_protocol=\"http\",response_code=\"200\",response_flags=\"-\",security_istio_io_tlsMode=\"istio\",service_istio_io_canonical_name=\"test1\",service_istio_io_canonical_revision=\"test1-00001\",serving_knative_dev_configuration=\"test1\",serving_knative_dev_configurationGeneration=\"1\",serving_knative_dev_configurationUID=\"07775999-0632-4a06-9680-94e82c663bb8\",serving_knative_dev_revision=\"test1-00001\",serving_knative_dev_revisionUID=\"3cce0d59-b49a-4ed3-90e9-4c8881d14c75\",serving_knative_dev_service=\"test1\",serving_knative_dev_serviceUID=\"07a4b48c-4bce-4f8c-a230-bb2223de4ecc\",source_app=\"activator\",source_canonical_revision=\"latest\",source_canonical_service=\"activator\",source_principal=\"spiffe://cluster.local/ns/knative-serving/sa/controller\",source_version=\"unknown\",source_workload=\"activator\",source_workload_namespace=\"knative-serving\"}"

Definitely more than 10 labels. Lets start with removing all the source_ and destination_ labels. I add this to the scrape_configs in prometheus.yml:

      metric_relabel_configs:
      - regex: "^(source|destination)_.*"
        action: labeldrop

Looks good in the Prometheus UI:

istio_requests_total{app="test1-00001",connection_security_policy="mutual_tls",instance="10.76.1.13:15020",istio_io_rev="default",job="kubernetes-pods",kubernetes_namespace="default",kubernetes_pod_name="test1-00001-deployment-844f655ddc-jsbl2",pod_template_hash="844f655ddc",reporter="destination",request_protocol="http",response_code="200",response_flags="-",security_istio_io_tlsMode="istio",service_istio_io_canonical_name="test1",service_istio_io_canonical_revision="test1-00001",serving_knative_dev_configuration="test1",serving_knative_dev_configurationGeneration="1",serving_knative_dev_configurationUID="07775999-0632-4a06-9680-94e82c663bb8",serving_knative_dev_revision="test1-00001",serving_knative_dev_revisionUID="3cce0d59-b49a-4ed3-90e9-4c8881d14c75",serving_knative_dev_service="test1",serving_knative_dev_serviceUID="07a4b48c-4bce-4f8c-a230-bb2223de4ecc"}

But in the sidecar log, new error messages start appearing:

level=debug ts=2021-07-29T12:15:26.814Z caller=client.go:202 component=storage msg="Partial failure calling CreateTimeSeries" err="rpc error: code = InvalidArgument desc = Field timeSeries[17].points[0].distributionValue had an invalid value: Distribution |explicit_buckets.bounds| entry 1 has a value of 0.5 which is less than the value of entry 0 which is 0.5."
level=warn ts=2021-07-29T12:15:26.814Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = Field timeSeries[17].points[0].distributionValue had an invalid value: Distribution |explicit_buckets.bounds| entry 1 has a value of 0.5 which is less than the value of entry 0 which is 0.5."

So already this isn't working with the sidecar, and I'm not even close to 10 labels.

On the other hand, everything works perfectly fine when I switch to a sidecar image built from https://github.com/Stackdriver/stackdriver-prometheus-sidecar/pull/283 and use the following sidecar.yml:

    metric_label_filters:
      - metric: "^istio_(request|response|tcp).*"
        allow:
          - istio_canonical_name
          - istio_canonical_revision
          - reporter
          - request_protocol
          - response_code
          - response_flags

(The istio_ labels are mapped from service_istio_io_ in Prometheus, because the sidecar drops those.)