centerforaisafety / cerberus-cluster

HPC cluster code and configurations for running on OCI
Universal Permissive License v1.0
4 stars 0 forks source link

thanos timing receiver issue #199

Closed ghost closed 1 week ago

ghost commented 9 months ago

Thanos's eventually consistent write strategy is great and all but our instance is not getting all the data that the prometheus node pushing it's metrics to.

We need to figure out why this is happening and fix the issue either within Thanos or with setting up Prometheus to have more than one main instance that is scraping metrics from instances and containers.

ghost commented 6 months ago

It looks like thanos is starving for resources- will need to change it's requested to provide more when it bursts.