Closed sesto closed 2 years ago
Thanks for the report. I filed an internal tracker issue and someone will try to help you soon. If you are willing to try something, could you see if using a metric name without any colons (:) works? It's just a hunch, but worth trying if you have the time.
Nope, replacing colons :
with underscore _
does not help
I have a similar issue. I'm exporting a gauge metric and the metric itself is autocreated in Stackdriver with no problems. Labels are correct etc. But all the values are 0 in Stackdriver, even though prometheus clearly shows them varying as containers change status. I've even verified this via the api and getting the raw timeseries from stackdriver.
static_metadata:
- metric: kube_pod_container_status_waiting_reason
type: gauge
Other metrics that I've checked seem to work. Let me know if there's anything I can do to assist.
^ worth noting: this is not a recorded rule, just a collected metric from kube_status. I merely did the explicit static_metadata entry to see if it changed anything.
@oliver-neubauer-ecobee please file a new issue if you're not using recorded metrics. Also the original issue is about metrics not being sent. Metrics sent as 0 values sounds different.
The same issue is in v0.4.1.
In debug log i see target not found
message and dropped stat recorded with target_not_found
reason.
Thanks for reporting, @soymo. Just want to ask a clarification question: do you see the target not found
on recorded metrics in v0.4.1?
Also if possible, can you create a new issue describing the following setup:
Thanks!
I confirmed it's a bug still happening in version 0.4.3 and I also see error target_not_found
after I created recorded rules and added static_metadata.
I reproduced it by setting recording rules as written in the first post:
- name: custom_group.rules
rules:
- expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[3m])) BY (instance, job)
record: instance:node_cpu_custom:rate:sum
I observed that I can see the metric instance:node_cpu_custom:rate:sum
in Prometheus UI.
I also added the static_metadata section as provided in stackdriver-prometheus-sideacr's configuration:
static_metadata:
- metric: instance:node_cpu_custom:rate:sum
type: gauge
help: node cpu ratio
and I turned on the debug log and observed that series_cache couldn't find target with __name__
as instance:node_cpu_custom:rate:sum
(along with labels job and instance).
Tomorrow I will spend some time to troubleshoot the issue.
I can confirm that this is still happening with version 0.5.0 of the sidecar and prometheus v2.10.0. Better yet, I believe can tell you why:
The targets cache in targets/cache.go
uses a concatenation of the job
and instance
labels as the key to the cache. But in many cases, recording rules will be aggregating across jobs/instances and those labels will be lost. With debug logging on, you will get a misleading error message suggesting that the timeseries cannot be retrieved (https://github.com/Stackdriver/stackdriver-prometheus-sidecar/blob/master/retrieval/series_cache.go#L346-L355); what's actually failing is the cache lookup.
The prometheus Labels struct exposes a Hash()
method (https://github.com/prometheus/prometheus/blob/master/pkg/labels/labels.go#L121-L131) which would seem to be appropriate for exactly the purpose of being used as a cache key, but my attempt to replace targets.cacheKey()
with:
func cacheKey(lset labels.Labels) string {
return strconv.FormatUint(lset.Hash(), 10)
}
...alters enough of the cacheing behavior that targets/cache_test expects that I'm not sure how to validate that it is or could be safe.
@StevenYCChou I've spent a bit more time trying to debug this and have not gotten very far: if you've got any suggestions I'm all ears.
Hi @memory - sorry I didn't provide update since last comment. Last time we suspected it's the logic in targetMatch may not be correct.
Based on the comment of this function, it should check match a label if lset
has this label set. The existing code checked lset.Get(tl.Name) != tl.Value
directly, but lset.Get(tl.Name)
will return empty string if lset
does not have tl.Name
.
I didn't commit a change I was debugging on, so I just created a branch in my personal repository to track the fix. I don't think it's ready to send a PR because I don't think I fully tested it yet. Can you try it out? If you want, you can build on top of it and send a PR for us to fix the bug you are facing.
Last thing - the metric name with ":" inside is not a valid metric type (see naming rules), so you may still won't see metrics if the recorded name is instance:node_cpu_custom:rate:sum
as described in https://github.com/Stackdriver/stackdriver-prometheus-sidecar/issues/104#issue-419958292.
Let us know what do you think, @memory.
@StevenYCChou so, some progress:
If I attempt to forward a recorded metric that aggregates away the job/instance labels, it still fails with the same cache lookup errors. :(
If I include the job/instance labels (suboptimal, but one thing at a time), we get somewhere:
level=debug ts=2019-08-29T15:10:10.164Z caller=client.go:170
component=storage
msg="Partial failure calling CreateTimeSeries"
err="rpc error:
code = InvalidArgument
desc = Errors during metric descriptor creation:
{(metric: external.googleapis.com/prometheus/dst_service:response_latency_ms:histogram_quantile95_rate300s, error: The metric type is invalid.),
(metric: external.googleapis.com/prometheus/dst_service:response_latency_ms:histogram_quantile99_rate300s, error: The metric type is invalid.)}."
level=warn ts=2019-08-29T15:10:10.164Z caller=queue_manager.go:546
component=queue_manager
msg="Unrecoverable error sending samples to remote storage"
err="rpc error:
code = InvalidArgument
desc = Errors during metric descriptor creation:
{(metric: external.googleapis.com/prometheus/dst_service:response_latency_ms:histogram_quantile95_rate300s, error: The metric type is invalid.),
(metric: external.googleapis.com/prometheus/dst_service:response_latency_ms:histogram_quantile99_rate300s, error: The metric type is invalid.)}."
The api documentation for metricDescriptors only notes that the type
field is a string and doesn't suggest that there are any length/content limitations on it?
Further testing: if I drastically shorten the name of the recorded metric and remove the colons:
- record: svc_response_latency_ms_q50_r30s
expr: histogram_quantile(0.50, sum(rate(response_latency_ms_bucket{namespace="default", direction="outbound", dst_service=~".+"}[30s])) by (le, dst_service, instance, job))
...the errors go away, which is good, but now we're back to literally nothing happening on the stackdriver side. No debug output at all from the sidecar, and external.googleapis.com/prometheus/svc_response_latency_ms_q50_r30s
does not appear in the output of metricDescriptors.list
Okay, realized that the shortened names weren't being caught by my --include
regex. With that fixed, we move on to the next error:
level=debug
ts=2019-08-29T17:43:35.007Z
caller=series_cache.go:406 component="Prometheus reader"
msg="metadata not found"
metric_name=svc_ltn_q99_r300s
level=debug
ts=2019-08-29T17:43:35.007Z
caller=series_cache.go:406
component="Prometheus reader"
msg="metadata not found"
metric_name=svc_ltn_q99_r300s
...which means we're dropping the whole series I believe, per https://github.com/Stackdriver/stackdriver-prometheus-sidecar/blob/master/retrieval/series_cache.go#L403-L407
...which leads me to the smoking gun in https://github.com/Stackdriver/stackdriver-prometheus-sidecar/blob/master/metadata/cache.go#L124-L133:
// The metric might also be produced by a recording rule, which by convention
// contain at least one `:` character. In that case we can generally assume that
// it is a gauge. We leave the help text empty.
if strings.Contains(metric, ":") {
return &scrape.MetricMetadata{
Metric: metric,
Type: textparse.MetricTypeGauge,
}, nil
}
return nil, nil
So the metadata cache will only detect a recorded rule if there is a colon in the metric name, but colons aren't allowed in stackdriver custom metric names, so we're stuck: if we don't put colons in the name, the metadata cache returns nil, but if we do put colons in the name stackdriver will never accept them. 🤦‍♂
My best suggestion here is to check for metric names that begin with a string like recorded_
and document that?
Hi memory@,
Thanks for sharing with us all the troubleshooting information, these are really valuable, and I really appreciate your efforts on it!
First, some comments on job/instance labels:
- If I attempt to forward a recorded metric that aggregates away the job/instance labels, it still fails with the same cache lookup errors. :(
Labels job and instance are required for integration with recording rules. Here is a quote from Integration Issues [1]:
Recording rules that change or remove either the job or instance labels aren't supported.
The reason is that stackdriver-prometheus-sidecar uses these two well-known labels to construct a Stackdriver MonitoredResource for your Kubernetes object (it’s also answered in Integration Issues [1])
For the PR you created, I have briefly reviewed it, and I have thought about the following 2 higher-level questions for any recording rules with colons:
Unfortunately, I don’t have the bandwidth at least before the end of September to answer the 2 questions above. I think it requires to come up with reasonable options, and analyze the pros and cons of these options.
For the issue you have here, I’d like to provide a direction to fix the code and a potential workable for recording rules if you don’t mind to try it out, though I couldn’t verify it myself before end of September:
Let me know what do you think, @memory .
[1] https://cloud.google.com/monitoring/kubernetes-engine/prometheus#prometheus_integration_issues [2] https://prometheus.io/docs/concepts/data_model/
Okay, we're in agreement that forwarding metrics that lack instance/job labels isn't currently supported and it would require a substantial rewrite of the sidecar to support. I'd gently suggest that given stackdriver's stratospheric per-metric cost, this feels like a limitation that should be addressed at some point: "just forward the raw metrics and do the aggregation on the stackdriver side" is a strategy with in many cases a large number of dollar signs attached to it. But this is obviously a larger issue than this one particular bug and isn't going to get solved in the short term.
For the nearer-term issues, I'll try to update my PR with the test cases you suggested and update the documentation with a suggested configuration for renaming and properly labelling recorded metrics.
@memory have you checked out the Counter Aggregator feature? It was added for your use case.
@jkohen I did, but unfortunately (a) the raw metrics in question here were histograms, not gauges, and (b) I needed to group them by the labels that I intended to preserve (ie: for every deployment, aggregate the 50/95/99th histogram_quantile for request latency across all of the deployment's pods without having to manually specify each deployment name in the sidecar config); there did not seem to be a way to do either of those things with the counter aggregator although if I missed it I'd be happy to hear it.
@memory it works for counters; I don't think it'll give you the right result for gauges and histograms. It should in principle work for histograms, with some additional work. We need to prioritize that with other work we have on our plates. Once we're ready, we'll make sure it covers your use case. Thanks for the feedback and sorry we don't have a better answer right now.
@StevenYCChou so, some progress:
- If I attempt to forward a recorded metric that aggregates away the job/instance labels, it still fails with the same cache lookup errors. :(
- If I include the job/instance labels (suboptimal, but one thing at a time), we get somewhere:
level=debug ts=2019-08-29T15:10:10.164Z caller=client.go:170 component=storage msg="Partial failure calling CreateTimeSeries" err="rpc error: code = InvalidArgument desc = Errors during metric descriptor creation: {(metric: external.googleapis.com/prometheus/dst_service:response_latency_ms:histogram_quantile95_rate300s, error: The metric type is invalid.), (metric: external.googleapis.com/prometheus/dst_service:response_latency_ms:histogram_quantile99_rate300s, error: The metric type is invalid.)}." level=warn ts=2019-08-29T15:10:10.164Z caller=queue_manager.go:546 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = Errors during metric descriptor creation: {(metric: external.googleapis.com/prometheus/dst_service:response_latency_ms:histogram_quantile95_rate300s, error: The metric type is invalid.), (metric: external.googleapis.com/prometheus/dst_service:response_latency_ms:histogram_quantile99_rate300s, error: The metric type is invalid.)}."
The api documentation for metricDescriptors only notes that the
type
field is a string and doesn't suggest that there are any length/content limitations on it?
@memory Just curious - Based on the recording rules you provided, do you try to get quantileX from a Prometheus Histogram?
Prometheus Histogram type will be transformed into Stackdriver Distribution type of timeseries, and you can aggregate by Xth percentile and group by your labels as well.
If you have any other use case beyond the observation I have above, please let me know too.
@StevenYCChou yeah, I'm using recording rules to get a quantile from the prometheus histogram and forwarding it to stackdriver as a simple metric, e.g.:
- record: recorded_svc_latency_ms_p50_30s
expr: histogram_quantile(0.50, sum(rate(response_latency_ms_bucket{namespace="default", direction="outbound", dst_service=~".+"}[30s])) by (le, dst_service, instance, job))
Unfortunately, simply forwarding the histogram directly isn't possible for us and is how we got into this situation in the first place. The source histogram has more labels than the current stackdriver limit, so we were attempting to use recording rules to drop/aggregate across labels that we didn't need in stackdriver.
Not sure if this is related, but we also got some metrics that are not properly forwarded to StackDriver. While metrics collected from the services directly are properly propagated, metrics collected from the Pushgateway do not seem to work.
When setting the log level to debug we also got the target not found
message for the exact metrics that are missing in StackDriver.
level=debug ts=2019-12-18T16:57:43.193Z caller=series_cache.go:354 component="Prometheus reader" msg="target not found" labels="{__name__=\"core_assortment_articles_total_count\",job=\"core-assortment\"}"
Versions in use are:
0.7.0
v2.13.1
Metric type is Gauge
.
I am not sure if this fits into the same issue discussed here, if that is not the case let me know and I will create a separate issue.
Edit: I just converted the service to be scraped by Prometheus and not using the pushgateway and the metrics where properly forwarded, so there is some issue using the pushgateway
and stackdriver-prometheus-sidecar
I am also getting same issue. i have record_rule as below
- record: free_memory_percent
expr: 100 - 100 * (node_memory_MemFree_bytes / node_memory_MemTotal_bytes)
but unable to see on stackdriver. somehow stackdriver side car not sending this metric. recorded rule able to see on prometheus metric dashboard.
Please help me here.
It's been about 4 months since the last update to this issue, and nearly a year since it was first opened. I'm currently tottering along with the workaround from https://github.com/Stackdriver/stackdriver-prometheus-sidecar/pull/159 but it would be nice to have some indication of whether and when this is likely to be fixed upstream.
Gentle ping, one quarter later. Obviously the pandemic has derailed many a timetable; I hope everyone involved here is healthy and safe and keeping it that way should be everyone's #1 priority.
Somewhere deep below that on the list of priorities: is this on anyone's radar? :)
I have the same problem when trying to send data to local Kubernetes.
Everything is complicated by the fact that I cannot find any similar complete examples in the documentation. Please help if someone has something similar.
We are not talking about custom metrics, but about the metrics of the Kubernetes.
global:
scrape_interval: 10s
evaluation_interval: 15s
scrape_configs:
- job_name: 'k8s_metrics'
scrape_interval: 10s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__="kube_deployment_labels"}'
static_configs:
- targets:
- 'xxxxxx'
My docker-compose
version: '3.3'
services:
volumes-provisioner:
image: hasnat/volumes-provisioner
environment:
PROVISION_DIRECTORIES: "65534:65534:0755:/prometheus/data"
volumes:
- ./prom_data_k8s:/prometheus/data
prometheus:
image: prom/prometheus:v2.19.3
restart: always
depends_on:
- volumes-provisioner
ports:
- 9080:9090
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.min-block-duration=15m
- --storage.tsdb.retention=24h
volumes:
- ./prometheus-old-k8s.yml:/etc/prometheus/prometheus.yml:ro
- ./prom_data_k8s:/prometheus/data
sidecar:
image: gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar:0.8.0
restart: always
depends_on:
- volumes-provisioner
- prometheus
command:
- '--prometheus.wal-directory=/prometheus/data/wal'
- '--prometheus.api-address=http://prometheus:9090/'
- '--stackdriver.project-id=xxxxxxx'
- '--stackdriver.generic.location=global'
- '--stackdriver.generic.namespace="test-namespace"'
- '--stackdriver.kubernetes.location=global'
- '--stackdriver.kubernetes.cluster-name=oldk8s'
- '--stackdriver.metrics-prefix=kubernetes.io/anthos'
- '--log.level=debug'
volumes:
- ./prom_data_k8s:/prometheus/data
- ./monitoring-metric-uploader.json:/etc/monitoring-metric-uploader.json:ro
environment:
- GOOGLE_APPLICATION_CREDENTIALS=/etc/monitoring-metric-uploader.json
level=debug ts=2020-08-05T09:51:45.245Z caller=series_cache.go:354 component="Prometheus reader" msg="target not found" labels="{__name__=\"kube_deployment_labels\",deployment=\"qrcode-service\",endpoint=\"https-main\",instance=\"10.244.66.89:8443\",job=\"kube-state-metrics\",label_app=\"qrcode\",label_chart=\"api-service-chart-0.1.0\",label_heritage=\"Tiller\",label_release=\"qrcode\",namespace=\"apis\",pod=\"kube-state-metrics-5456d85ddc-j86ld\",service=\"kube-state-metrics\"}"
@StevenYCChou it's approaching a year since the last substantial update on this issue. Is this expected to be addressed any time soon?
This has also been a bit of a thorn in our side too, would love some idea of getting something like @memory's change into upstream? :)
Hi @qingling128 and welcome to this extremely fun ticket! :) Let me know if I can offer any other details here.
Is there any news on this?
Coming back to this after a long absence, I think several of us (myself very much included) got stuck in a rabbit hole when there is a much simpler solution available for most cases.
With only the code change to targets/cache.go
in #159 (and none of the other changes around a magic prefix for recorded , and for at least prometheus recording rules that produce GAUGE and CUMULATIVE metrics, it is possible for the sidecar to forward them with only configuration changes.
For example a recording rule like:
groups:
- name: service_aggregations
rules:
- record: job:response_total:sum
expr: sum(response_total{namespace="default", direction="inbound"}) by (app, instance, job, classification, status_code)
Forwarding job:response_total_sum
to stackdriver works as long as you use metric_renames
to rename the metric to not have colons, and use static_metadata to (if necessary) fix up the type and value:
metric_renames:
- from: job:response_total:sum
to: recorded_job_response_total_cumulative
static_metadata:
- metric: recorded_job_response_total_cumulative
type: counter
value_type: double
help: aggregated response totals to each service by job and response code
I have not yet tested to see if this would work for prometheus histograms / stackdriver distributions.
It's a little unclear to me if this was something that would have worked all along and we were all too deep in the weeds to see it, or if something in between the 0.5.0 and 0.8.1 releases made this possible. In any case I think the main need at this point is the fix to targets/cache and some explicit documentation around the case of recorded metrics: https://github.com/Stackdriver/stackdriver-prometheus-sidecar/pull/266
(As a small followup: exporting recorded histograms does not seem to currently be possible. The current code assumes that all recorded metrics are gauges, and attempting to pin the value/value_type as counter/distribution produces an error that distribution is not a supported value type.)
With the merge of #266 I believe this bug can, at long last, be closed.
I have the following recording rule:
The rule is evaluated and recorded correctly:
I have also the following config file:
No metric appers in Stackdriver. Also no errors in the stackdriver-prometheus-sidecar logs. This is just an example of the recorded metric. All other recorded metrics are also not sent to Stackdriver. Raw metrics are exported to Stackdriver as it should be.