Stackdriver / stackdriver-prometheus-sidecar

A sidecar for the Prometheus server that can send metrics to Stackdriver.
https://cloud.google.com/monitoring/kubernetes-engine/prometheus
Apache License 2.0
120 stars 43 forks source link

Repeated failure messages, large GCP metrics billing #223

Open G-Goldstein opened 4 years ago

G-Goldstein commented 4 years ago

We've been running Prometheus for a few months now, but with only two metrics; about a dozen times a day, we run a process and we record into Prometheus the time it's taken, as well as a final integer gauge of how many things it's worked on. I've been able to use Prometheus in Kubernetes to record this information and I've put Grafana over the front of it to see the data, so I know that part works.

On Wednesday I added this Stackdriver Prometheus Sidecar following the kube/patch.sh shell script. And it worked - I can now see my Gauge in Stackdriver, and the values are correct. But the sidecar has been spitting out the same log message repeatedly, sometimes multiple times a millisecond, ever since it started up. They all say:

"level=warn ts=2020-03-06T08:23:18.737Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: external.googleapis.com/prometheus/up, Timestamps: {Youngest Existing: '2020/03/06-00:23:13.168', New: '2020/03/06-00:23:16.394'}}: timeSeries[0-95]"

What's more, we've been charged £285 for "Metric Volume" by GCP in the last two days, up from 0 before that, which is more than all of our other monthly costs combined and seems excessive for the two metrics I'm currently requesting. So I think it's trying to do too much. I do know that Prometheus on Kubernetes by default collects all the machine diagnostics - should I be disabling that somehow?

On the point of the metrics I'm after, although they should each give a data point about a dozen times a day, I can't guarantee that it won't produce data more than once in a minute, which seems to be a requirement for Stackdriver. I was hoping that this sidecar would do the aggregation that I can't easily do in my process. Am I using it wrong?

brunodomenici commented 4 years ago

I have a similar situation here...