Closed andreasmh closed 8 months ago
Hi @andreasmh,
I just looked at another issue and discussed in https://github.com/GoogleCloudPlatform/prometheus-engine/issues/823#issuecomment-1949347619. Do you have any similar relabeling logic?
No, not similar to that. Most of the scrapes erroring is tied to Istio in some way.
For example, our scrape config for the ingressgateways looks like this:
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: monitoring-istio-data-plane-ingressgateway
namespace: istio-system
spec:
selector:
matchLabels:
istio: ingressgateway
endpoints:
- port: http-envoy-prom
scheme: http
interval: 30s
path: /stats/prometheus
targetLabels:
fromPod:
- from: app
to: app
- from: istio
to: istio
All the errors complain about only 1-3 failed points. If we had duplicate scraping configuration, wouldn't we be seeing a lot more failed points?
Here is an example log with some relevant values (but most labels removed):
"caller":"export.go:940", "component":"gcm_exporter", "err":"rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point.: prometheus_target{cluster:main,location:europe-north1,namespace:istio-system,instance:istio-ingressgateway-577456d945-nrpk7:http-envoy-prom,job:monitoring-istio-data-plane-ingressgateway} timeSeries[0-199]: prometheus.googleapis.com/istio_requests_total/counter{pod:istio-ingressgateway-577456d945-nrpk7,app:istio-ingressgateway,response_code:401,source_app:istio-ingressgateway,container:istio-proxy}
error details: name = Unknown desc = total_point_count:200 success_point_count:199 errors:{status:{code:3} point_count:1}", "level":"error", "msg":"send batch", "size":200, "ts":"2024-02-19T07:42:03.550Z"}
It really depends. Do the errors consistently report the istio_requests_total
metric in particular? Or are there other metrics? Also, is it for these same labels (i.e. same instance
, job
, response_code
, etc)?
And is this consistently happening? Or only transiently (e.g. during pod restarts or K8s upgrades)?
Since you're using a PodMonitoring
and no metric relabeling, I don't think there's anything fishy with the scrape configuration itself.
FYI the targetLabels.fromPod.to
fields are optional if you're not renaming them to a different label.
We've seen some issues with istio-envoy before, but it has to do with the client/exporter not with GMP.
IIRC it has to do with the exporter itself generating duplicate time series due to "destination_workload" and "source_workload" both being set to unknown due to the traffic coming from outside the cluster. Doesn't look to be the exact case here but it is likely something similar.
Not only do those 6 fields need to be unique, every time series needs to have a unique combination of label values. If the exporter is setting some timeseries to the same value then it will almost certainly result in duplicate time series.
Unfortunately there's no real way to diagnose this besides going to the relevant /metrics endpoint and looking for duplicate time series. Copy-pasting the labels from the error message and searching for them in the /metrics output will likely show you that some time series are duplicate due to these label values being hardcoded.
Closing for now given Lee's comment.
We are getting the error:
I saw this already existed: https://github.com/GoogleCloudPlatform/prometheus-engine/issues/814 But it was closed, so I decided to create a new one.
We are seeing the same issue. We have gone through the troubleshooting guide and verified that:
Most, if not all, errors show:
Does this mean that 199 metric points were successfully added? And only 1 failed? Could there be some kind of "off-by-one" error in the collection/scraping?
If we successfully added 199 point and only failed on one (that had already been added previously?), should it really be an error log?
I tried looking at the metrics in question, and I'm not seeing any gaps.
Can I safely ignore these errors, or is it something I should be concerned about?
Collector version: v2.41.0-gmp.7-gke.0 Cluster version: 1.27.8-gke.1067004