Query issues for "large" clusters

poblahblahblah commented 6 years ago

Hello,

We have a ~540 node cassandra cluster that are exporting ~1500 metrics each. We're sending over 800k time series in the cassandra_stats metric namespace. This is causing a lot of issues when querying Prometheus since the index gets hit so hard. Recording rules are definitely an option, but we don't always know in advance when something should have a recording rule to perform any aggregation.

Is there a workaround for this in the current code base? If not, would you be open to exploring a change with us?

erebe commented 6 years ago

Hello,

Sadly there is no magic bullet, at least that I am aware of, to scale out Cassandra metrics. Before running prometheus we were having also issues fitting everything into Graphite TSDB (~130 cassandra nodes).

AFAIK, Prometheus index are by labels and not only just on the namespace, so I can't think of much improvements I can make in the code. If you have an idea or think otherwise feel free to tell, I am listenning you.

Here is what I can propose you :

Reduce the metrics you are exporting to the strict minimum. That has the caveat that when you will want to drill down on an issue, you will need to get visualvm up and running and connect to nodes
Make recording rules (that helped us a lot) in order to pre-aggregate high level metrics. You will still hit slow queries when willing to investigate by nodes
Use Prometheus federation to split the scrapping over multiple Prometheus instances and aggregate high level metrics into a global one. We use this solution as it let you have high level metrics be fast with the global Prometheus, but still be able to drill down by nodes and get the time series by querying the local Prometheus. That's the best of all solutions, but at the expanse of more machines for running your prometheus stack. We have ~550 machines(there are more services) that we monitor this way in our biggest datacenter.

If you think of some other solutions feel free

erebe commented 6 years ago

feel free to re-open if needed

criteo / cassandra_exporter

Query issues for "large" clusters #21