criteo / graphite-remote-adapter

Fully featured graphite remote adapter for Prometheus
Apache License 2.0
38 stars 25 forks source link

Queue full #69

Open razumv opened 4 years ago

razumv commented 4 years ago

Hello, I have a problem with queue in prometheus + graphite-remote-adapter level=warn ts=2019-12-10T08:31:54.018127762Z caller=queue_manager.go:230 component=remote queue="0:http://***/write?graphite.default-prefix=kube_poly_ " msg="Remote storage queue full, discarding sample. Multiple subsequent messages of this kind may be suppressed."

Prometheus & adapter config is default only 10% of metrics from 70 computers reach

InformatiQ commented 4 years ago

when graphite-remote-adapter is unable to send the metrics to graphite for any reason it's internal queue gets filled and hence stops receiving metrics. Prometheus will start droppping samples so it doesn't fill its queue. you might want to add more instances of graphite-remote-adapter to support the load. what we do is have many graphite remote-adapters behind an LB to make it easy to scale as needed

razumv commented 4 years ago

For example now remote_adapter_sent_batch_duration_seconds_sum{} 127918.57609338008 remote_adapter_sent_batch_duration_seconds_count{} 152966 remote_adapter_sent_samples_total{} 14458568 It turns out I only got 152k from 14kk data?

I use 1 remote addr for all my adapters now(5)

OK, this is my conf

Prometheus:

remote_write:

My prometheus log: level=info ts=2019-12-12T06:53:29.923422835Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=1 to=17 level=info ts=2019-12-12T06:53:39.923304302Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=17 to=35 level=info ts=2019-12-12T06:53:49.923441611Z caller=queue_manager.go:343 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Currently resharding, skipping." level=info ts=2019-12-12T06:53:59.923354973Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=35 to=98 level=info ts=2019-12-12T06:54:19.923450376Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=98 to=153 level=info ts=2019-12-12T06:54:29.92329724Z caller=queue_manager.go:343 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Currently resharding, skipping." level=info ts=2019-12-12T06:54:39.923358133Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=153 to=274 level=info ts=2019-12-12T06:57:29.923270195Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=274 to=165 level=info ts=2019-12-12T07:00:19.923636921Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=165 to=106 level=info ts=2019-12-12T07:00:39.92329415Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=106 to=152 level=info ts=2019-12-12T07:03:19.923396969Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=152 to=98

This happens periodically in the prometheus log: level=warn ts=2019-12-12T08:17:47.125876374Z caller=queue_manager.go:531 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Error sending samples to remote storage" count=100 err="context deadline exceeded" level=warn ts=2019-12-12T08:17:47.3388834Z caller=queue_manager.go:531 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Error sending samples to remote storage" count=100 err="context deadline exceeded" level=warn ts=2019-12-12T08:17:48.171610956Z caller=queue_manager.go:531 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Error sending samples to remote storage" count=100 err="context deadline exceeded" level=warn ts=2019-12-12T08:17:48.181891084Z caller=queue_manager.go:531 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Error sending samples to remote storage" count=100 err="context deadline exceeded" In adapter log: {"caller":"write.go:167","component":"web","err":"request context cancelled","level":"warn","msg":"Error sending samples to remote storage","num_samples":100,"storage":"graphite","ts":"2019-12-12T08:17:48.434Z"} {"caller":"write.go:167","component":"web","err":"request context cancelled","level":"warn","msg":"Error sending samples to remote storage","num_samples":100,"storage":"graphite","ts":"2019-12-12T08:17:48.532Z"} {"caller":"write.go:167","component":"web","err":"request context cancelled","level":"warn","msg":"Error sending samples to remote storage","num_samples":100,"storage":"graphite","ts":"2019-12-12T08:17:48.534Z"}

InformatiQ commented 4 years ago

are you sure graphite is not having any issues? it could be slow at ingesting the samples. what is the cpu/mem resource usage like on th GRA instances?

razumv commented 4 years ago

Now adapter instances are not limited in resources. Now there are 5 of them, they consume about 0.1 processor cores and 700 megabytes of RAM each. graphite was deployed through docker-compose.

InformatiQ commented 4 years ago

how does the resource usage of graphit elook like? any errors in graphite side?

razumv commented 4 years ago

Graphite is 40% loaded, today we’ll deploy in a cluster and try to write to it. But can it be that the first packet of metrics does not have time to go and the next starts to go?