Open alxndr42 opened 3 years ago
Since dropping labels in Prometheus seems to create new problems, I've added customizable label filtering on finalLabels
to seriesCache.refresh()
. Seems to work fine in my local build of the sidecar. Would you be interested in receiving a PR for this?
I am using the config file to define filters, i.e.:
label_filters:
- metric: "^istio_(request|response|tcp).*"
allow:
- app
- destination_canonical_service
- instance
- job
- kubernetes_namespace
- kubernetes_pod_name
- response_code
- source_canonical_service
Hello ! I'm facing the exact same issue !
Here are the steps to reproduce
1.19.9-gke.1400
. And follow istio setup guide for GKE.kubectl label namespace default istio-injection=enabled
istioctl dashboard prometheus
, and look for the metric istio_requests_total
(for example)so far so good, this is standard istio installation, with some dummy istio-sidecar metrics from the bookinfo services and a prometheus instance. The istio metrics are visible in prometheus, scraping is working as expected. Then the goal is to deploy the stackdriver-prometheus-sidecar
as documented in Using Prometheus
A service account for the sidecar is created, and the service account key exported as a kubernetes secret, e.g.
$ gcloud iam service-accounts create prometheus-stackdriver --display-name prometheus-stackdriver-service-account
$ PROMETHEUS_STACKDRIVER_SA_EMAIL=$(gcloud iam service-accounts list --filter="displayName:prometheus-stackdriver-service-account" --format='value(email)')
$ gcloud projects add-iam-policy-binding ${PROJECT_ID} --role roles/monitoring.metricWriter --member serviceAccount:${PROMETHEUS_STACKDRIVER_SA_EMAIL}
$ gcloud iam service-accounts keys create prometheus-stackdriver-service-account.json --iam-account ${PROMETHEUS_STACKDRIVER_SA_EMAIL}
$ kubectl -n istio-system create secret generic prometheus-stackdriver-service-account --from-file=key.json=prometheus-stackdriver-service-account.json
then, the following patch is applied to the prometheus (istio) deployment to add the stackdriver-prometheus-sidecar
# prometheus-patch.yaml
spec:
template:
spec:
volumes:
- name: google-cloud-key
secret:
secretName: prometheus-stackdriver-service-account
containers:
- name: sidecar
image: gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar:0.8.2
imagePullPolicy: Always
args:
- "--stackdriver.project-id=XXX"
- "--prometheus.wal-directory=/data/wal"
- "--stackdriver.kubernetes.location=XXX"
- "--stackdriver.kubernetes.cluster-name=XXX"
- "--log.level=debug"
ports:
- name: sidecar
containerPort: 9091
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /var/secrets/google/key.json
volumeMounts:
- name: storage-volume
mountPath: /data
- name: google-cloud-key
mountPath: /var/secrets/google
with e.g.
$ kubectl -n istio-system patch deployment prometheus --type strategic --patch="$(cat prometheus-patch.yaml)"
now looking at the logs of the stackdriver-prometheus-sidecar
, I get, among others, those error messages
level=debug ts=2021-06-02T19:16:29.026Z caller=series_cache.go:395 component="Prometheus reader" msg="too many labels" labels="{__name__=\"istio_request_duration_milliseconds_bucket\",app=\"productpage\",connection_security_policy=\"unknown\",destination_app=\"reviews\",destination_canonical_revision=\"v3\",destination_canonical_service=\"reviews\",destination_cluster=\"Kubernetes\",destination_principal=\"spiffe://cluster.local/ns/default/sa/bookinfo-reviews\",destination_service=\"reviews.default.svc.cluster.local\",destination_service_name=\"reviews\",destination_service_namespace=\"default\",destination_version=\"v3\",destination_workload=\"reviews-v3\",destination_workload_namespace=\"default\",instance=\"10.72.1.9:15020\",istio_io_rev=\"default\",job=\"kubernetes-pods\",kubernetes_namespace=\"default\",kubernetes_pod_name=\"productpage-v1-6b746f74dc-mgblb\",le=\"3600000\",pod_template_hash=\"6b746f74dc\",reporter=\"source\",request_protocol=\"http\",response_code=\"200\",response_flags=\"-\",security_istio_io_tlsMode=\"istio\",service_istio_io_canonical_name=\"productpage\",service_istio_io_canonical_revision=\"v1\",source_app=\"productpage\",source_canonical_revision=\"v1\",source_canonical_service=\"productpage\",source_cluster=\"Kubernetes\",source_principal=\"spiffe://cluster.local/ns/default/sa/bookinfo-productpage\",source_version=\"v1\",source_workload=\"productpage-v1\",source_workload_namespace=\"default\",version=\"v1\"}"
At this point, just like @7adietri, I'm looking at a light way to transform those istio metrics so that they can be pushed to cloud monitoring. Ideally, the solution should be easily maintainable. From the Quotas and limits page for custom metrics, it looks like the maximum number of labels is 10.
@salanfe Check out PR https://github.com/Stackdriver/stackdriver-prometheus-sidecar/pull/283
We're currently deploying a custom build of the sidecar image, but I hope the PR will be accepted, so that we can switch back to the official image.
I have had success in the past as a GCP customer that opening a support case asking for this PR to be reviewed and merged has helped get a response as I am currently running into this exact issue.
Opening a support case now :)
@7adietri, could you please be more specific about the issues you ran into with metric_relabel_configs
(what was your exact configuration, what were the exact errors in the sidecar log, what environment you were running in)? Before we consider adding new functionality to the sidecar, it would help us to understand why the existing solution (metric relabeling on the Prometheus side) does not cover your use case. Thanks.
@igorpeshansky Ok, so this is a typical error message (only visible at debug
level) when using the sidecar in a cluster with Knative/Istio:
level=debug ts=2021-07-29T11:13:36.704Z caller=series_cache.go:395 component="Prometheus reader" msg="too many labels" labels="{__name__=\"istio_requests_total\",app=\"test1-00001\",connection_security_policy=\"mutual_tls\",destination_app=\"test1-00001\",destination_canonical_revision=\"test1-00001\",destination_canonical_service=\"test1\",destination_principal=\"spiffe://cluster.local/ns/default/sa/default\",destination_service=\"test1-00001-private.default.svc.cluster.local\",destination_service_name=\"test1-00001-private\",destination_service_namespace=\"default\",destination_version=\"unknown\",destination_workload=\"test1-00001-deployment\",destination_workload_namespace=\"default\",instance=\"10.76.1.13:15020\",istio_io_rev=\"default\",job=\"kubernetes-pods\",kubernetes_namespace=\"default\",kubernetes_pod_name=\"test1-00001-deployment-844f655ddc-jsbl2\",pod_template_hash=\"844f655ddc\",reporter=\"destination\",request_protocol=\"http\",response_code=\"200\",response_flags=\"-\",security_istio_io_tlsMode=\"istio\",service_istio_io_canonical_name=\"test1\",service_istio_io_canonical_revision=\"test1-00001\",serving_knative_dev_configuration=\"test1\",serving_knative_dev_configurationGeneration=\"1\",serving_knative_dev_configurationUID=\"07775999-0632-4a06-9680-94e82c663bb8\",serving_knative_dev_revision=\"test1-00001\",serving_knative_dev_revisionUID=\"3cce0d59-b49a-4ed3-90e9-4c8881d14c75\",serving_knative_dev_service=\"test1\",serving_knative_dev_serviceUID=\"07a4b48c-4bce-4f8c-a230-bb2223de4ecc\",source_app=\"activator\",source_canonical_revision=\"latest\",source_canonical_service=\"activator\",source_principal=\"spiffe://cluster.local/ns/knative-serving/sa/controller\",source_version=\"unknown\",source_workload=\"activator\",source_workload_namespace=\"knative-serving\"}"
Definitely more than 10 labels. Lets start with removing all the source_
and destination_
labels. I add this to the scrape_configs
in prometheus.yml
:
metric_relabel_configs:
- regex: "^(source|destination)_.*"
action: labeldrop
Looks good in the Prometheus UI:
istio_requests_total{app="test1-00001",connection_security_policy="mutual_tls",instance="10.76.1.13:15020",istio_io_rev="default",job="kubernetes-pods",kubernetes_namespace="default",kubernetes_pod_name="test1-00001-deployment-844f655ddc-jsbl2",pod_template_hash="844f655ddc",reporter="destination",request_protocol="http",response_code="200",response_flags="-",security_istio_io_tlsMode="istio",service_istio_io_canonical_name="test1",service_istio_io_canonical_revision="test1-00001",serving_knative_dev_configuration="test1",serving_knative_dev_configurationGeneration="1",serving_knative_dev_configurationUID="07775999-0632-4a06-9680-94e82c663bb8",serving_knative_dev_revision="test1-00001",serving_knative_dev_revisionUID="3cce0d59-b49a-4ed3-90e9-4c8881d14c75",serving_knative_dev_service="test1",serving_knative_dev_serviceUID="07a4b48c-4bce-4f8c-a230-bb2223de4ecc"}
But in the sidecar log, new error messages start appearing:
level=debug ts=2021-07-29T12:15:26.814Z caller=client.go:202 component=storage msg="Partial failure calling CreateTimeSeries" err="rpc error: code = InvalidArgument desc = Field timeSeries[17].points[0].distributionValue had an invalid value: Distribution |explicit_buckets.bounds| entry 1 has a value of 0.5 which is less than the value of entry 0 which is 0.5."
level=warn ts=2021-07-29T12:15:26.814Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = Field timeSeries[17].points[0].distributionValue had an invalid value: Distribution |explicit_buckets.bounds| entry 1 has a value of 0.5 which is less than the value of entry 0 which is 0.5."
So already this isn't working with the sidecar, and I'm not even close to 10 labels.
On the other hand, everything works perfectly fine when I switch to a sidecar image built from https://github.com/Stackdriver/stackdriver-prometheus-sidecar/pull/283 and use the following sidecar.yml
:
metric_label_filters:
- metric: "^istio_(request|response|tcp).*"
allow:
- istio_canonical_name
- istio_canonical_revision
- reporter
- request_protocol
- response_code
- response_flags
(The istio_
labels are mapped from service_istio_io_
in Prometheus, because the sidecar drops those.)
I've deployed the sidecar on a cluster with Istio/Knative, and some metrics aren't showing up in the Metrics Explorer. The sidecar log didn't reveal anything at first, only after turning on
debug
logging did I see a ton oftoo many labels
errors. (This in itself seems like a bug, or worthy of mentioning in the README.) According to the error messages, the problematic metrics have almost 40 labels, thanks to Istio/Knative.So I tried to configure Prometheus via
metric_relabel_configs
to drop some labels, but so far many labels cannot be removed without causingtarget not found
errors in the sidecar log (and metrics still not showing up). I'm not sure why i.e.security_istio_io_tlsMode
is a critical label to have.Do you have a suggested approach to reducing the amount of labels and keeping the sidecar happy?