Stackdriver / stackdriver-prometheus-sidecar

A sidecar for the Prometheus server that can send metrics to Stackdriver.
https://cloud.google.com/monitoring/kubernetes-engine/prometheus
Apache License 2.0
120 stars 43 forks source link

One or more points were written more frequently than the maximum sampling ... #224

Open dopey opened 4 years ago

dopey commented 4 years ago

I just set up stackdriver with prometheus in my kubernetes cluster. I have debug level logging turned on for the sidecar (it was very useful for debugging all the permissions issues I ended up running into). Now that the flow is working and sending data to stackdriver I'm noticing a lot of debug level messages in the logs that look like:

level=debug ts=2020-03-09T22:32:56.082Z caller=client.go:202 component=storage msg="Partial failure calling CreateTimeSeries" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: external.googleapis.com/prometheus/scrape_samples_post_metric_relabeling, Timestamps: {Youngest Existing: '2020/03/09-15:32:38.215', New: '2020/03/09-15:32:42.308'}}: timeSeries[0-31]"

Not just that specific metric, but many, many others. Any guidance on why this might be happening and how I can configure things to fix the problem (assuming this is in fact a problem).

jcortejoso commented 4 years ago

I am getting the same error but seems that it happens sending some series to an existing metric:

level=warn ts=2020-05-14T09:01:32.691Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: timeSeries[0-73]"

I do not know how to debug/know for which metrics is throwing those errors. Seems like the scrape_interval plays a major component for the issue, so for scrape_interval lower that 60s the error appears very frequently; for 60s the errors appears sometimes and for scrape_interval of 120s the error does not appear at all. Current quota limit for TimeSeries is 1metric/10 secods, but it seems currently in my project the limit is the old quota limit of 1metric/60 secods.

I would appreciate any help/progress on this issue.

StevenYCChou commented 4 years ago

@jcortejoso - I think the metric descriptor may still show up in other lines of the debug logs - in the log you provided, the message may be truncated. I have noticed before that some lines have the metrics name in the debug logs, sometimes they didn't show up.

It may be the logs are truncated, but it may also be bugs causing the metric name not logged.

See the debug log from the dopey - which has the actual metric name external.googleapis.com/prometheus/scrape_samples_post_metric_relabeling, and other debug information are similar to your logs.