GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
195 stars 93 forks source link

One or more TimeSeries could not be written: Points must be written in order. #851

Closed andreasmh closed 8 months ago

andreasmh commented 8 months ago

We are getting the error:

InvalidArgument desc = One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point.

I saw this already existed: https://github.com/GoogleCloudPlatform/prometheus-engine/issues/814 But it was closed, so I decided to create a new one.

We are seeing the same issue. We have gone through the troubleshooting guide and verified that:

Most, if not all, errors show:

error details: name = Unknown  desc = total_point_count:200  success_point_count:199  errors:{status:{code:3}  point_count:1

Does this mean that 199 metric points were successfully added? And only 1 failed? Could there be some kind of "off-by-one" error in the collection/scraping?

If we successfully added 199 point and only failed on one (that had already been added previously?), should it really be an error log?

I tried looking at the metrics in question, and I'm not seeing any gaps.

Can I safely ignore these errors, or is it something I should be concerned about?

Collector version: v2.41.0-gmp.7-gke.0 Cluster version: 1.27.8-gke.1067004

pintohutch commented 8 months ago

Hi @andreasmh,

I just looked at another issue and discussed in https://github.com/GoogleCloudPlatform/prometheus-engine/issues/823#issuecomment-1949347619. Do you have any similar relabeling logic?

andreasmh commented 8 months ago

No, not similar to that. Most of the scrapes erroring is tied to Istio in some way.

For example, our scrape config for the ingressgateways looks like this:

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: monitoring-istio-data-plane-ingressgateway
  namespace: istio-system
spec:
  selector:
    matchLabels:
      istio: ingressgateway
  endpoints:
    - port: http-envoy-prom
      scheme: http
      interval: 30s
      path: /stats/prometheus
  targetLabels:
    fromPod:
      - from: app
        to: app
      - from: istio
        to: istio

All the errors complain about only 1-3 failed points. If we had duplicate scraping configuration, wouldn't we be seeing a lot more failed points?

Here is an example log with some relevant values (but most labels removed):

"caller":"export.go:940", "component":"gcm_exporter", "err":"rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point.: prometheus_target{cluster:main,location:europe-north1,namespace:istio-system,instance:istio-ingressgateway-577456d945-nrpk7:http-envoy-prom,job:monitoring-istio-data-plane-ingressgateway} timeSeries[0-199]: prometheus.googleapis.com/istio_requests_total/counter{pod:istio-ingressgateway-577456d945-nrpk7,app:istio-ingressgateway,response_code:401,source_app:istio-ingressgateway,container:istio-proxy}
error details: name = Unknown  desc = total_point_count:200  success_point_count:199  errors:{status:{code:3}  point_count:1}", "level":"error", "msg":"send batch", "size":200, "ts":"2024-02-19T07:42:03.550Z"}
pintohutch commented 8 months ago

It really depends. Do the errors consistently report the istio_requests_total metric in particular? Or are there other metrics? Also, is it for these same labels (i.e. same instance, job, response_code, etc)?

And is this consistently happening? Or only transiently (e.g. during pod restarts or K8s upgrades)?

Since you're using a PodMonitoring and no metric relabeling, I don't think there's anything fishy with the scrape configuration itself.

FYI the targetLabels.fromPod.to fields are optional if you're not renaming them to a different label.

lyanco commented 8 months ago

We've seen some issues with istio-envoy before, but it has to do with the client/exporter not with GMP.

IIRC it has to do with the exporter itself generating duplicate time series due to "destination_workload" and "source_workload" both being set to unknown due to the traffic coming from outside the cluster. Doesn't look to be the exact case here but it is likely something similar.

Not only do those 6 fields need to be unique, every time series needs to have a unique combination of label values. If the exporter is setting some timeseries to the same value then it will almost certainly result in duplicate time series.

Unfortunately there's no real way to diagnose this besides going to the relevant /metrics endpoint and looking for duplicate time series. Copy-pasting the labels from the error message and searching for them in the /metrics output will likely show you that some time series are duplicate due to these label values being hardcoded.

pintohutch commented 8 months ago

Closing for now given Lee's comment.