Cassandra: monitoring - Githubissues

jazzl0ver commented 6 years ago

Would like to open discussion on this topic. Some suggestions:

Add Jolokia jar to C* containers. It allows to easily fetch metrics by HTTP requests
Make it configurable (at C service creation and C update-service) to auto-create CloudWatch alarms for the most important metrics (like latency and free disk space).
Create a new service to run TICK (https://github.com/influxdata/sandbox) stack (in a single ECS task), which will retrieve metrics from C* (Telegraf), feed them into InfluxDB, provide a ready-to-use dashboards for various metrics (Chronograf).

Please, share your thoughts.

http://jolokia.org/agent/jvm.html https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/ http://cassandra.apache.org/doc/latest/operating/metrics.html

JuniusLuo commented 6 years ago

The Cassandra metrics could be collected using nodetool, JConsole, or JMX. See Cassandra Monitoring. The datadog blog you posted also mentioned the same methods. Datadog would also collect the Cassandra metrics using one of these ways.

The initial plan is to follow the standard way (nodetool) to get Cassandra metrics, and send the metrics/alert to CloudWatch Metrics/Alarms. We will not add any additional library to the Cassandra container. We could have a general policy-based framework. The framework allows the customer to customize the policy, such as the metric collecting interval, the metrics to collect, etc. The framework will schedule the task(s) accordingly. The task(s) will use nodetool to connect to Cassandra nodes to get metrics and send to CloudWatch. The task will end after that, so the task only consumes resource when it is running. Each service could define its own monitoring task.

We will define the standard metrics/alarm APIs, and have one implementation for CloudWatch. In the future, we could easily add the implementation for Azure/GCP and other implementation, which may use TICK.

jazzl0ver commented 6 years ago

Thanks for the detailed explanation of your point! I agree that injecting a side library does not sound very well, but:

Jolokia library is a wrapper for JMX counters. It just adds an option to collect them thru simple HTTP requests. No local java needed to query them. Thus it requires much less of resources for the service which will query C* metrics. And it's opensourced.
nodetool is a java app, so it starts slowly and unable to query multiple metrics at once.
nodetool's output should be parsed before injecting metrics into CloudWatch. How are you going to deal with that? Also the output might change slightly in next C* releases, which will require extra work for the parser adaptation.
I'm not sure DataStax uses nodetool to collect metrics for opsCenter, since it installs a proprietary agent on every C* node and might query it to get metrics.

Regarding the monitoring task. Do you really think it's a good idea to continuously start and stop it? For example, for 1 minute metrics collection interval, it seems to me it might be started just a moment later after it was stopped. Just b/c a lot of metrics must be collected, processed and injected into CloudWatch.

JuniusLuo commented 6 years ago

Good point! There are many existing monitoring solutions. We will explore the existing solutions first, and leverage the open source solutions as much as possible. We will only consider to build our own solution when we could not find the suitable solution.

Jolokia is a good tool. It actually supports the proxy mode. We could test to see if the proxy mode works for Cassandra.

Telegraf is a good project. It supports to get metrics for many services, and could send the metrics to CloudWatch. It may be a better framework than CollectD. This blog has a good comparison.

JuniusLuo commented 6 years ago

For the monitoring task, it would be ok to keep the monitoring task running. The monitoring service would be a better name than task. Assume the framework such as Telegraf only has a small memory footprint.

It would not be a problem to keep the task short as well. Collecting the metrics of one Cassandra or other service node will be fast, unless something happens. For example, Cassandra itself is stuck at gc. The metrics data will be small. The processing and sending to like CloudWatch would be fast as well. The monitoring collection/handling would not take more than a few seconds. But it requires more work for the scheduling framework. We could start with the long run monitoring service first.

JuniusLuo commented 6 years ago

It turns out adding Jolokia into Cassandra container is the simplest way. Telegraf is also supported. Monitoring Cassandra, Redis and ZooKeeper are supported. You could create a Telegraf service for the Cassandra service and see the metrics on CloudWatch. Please take a look and share your comments/suggestions.

Note: currently Cassandra Keyspaces and tables are not monitored. The system keyspaces introduces more than 1000 metrics. Further enhancements will be added to monitor the user keyspaces.

jazzl0ver commented 6 years ago

That's a great news!! What is the upgrade path? If possible, I wouldn't like to re-create our cassandra services.

Please, add Telegraf service creation tutorial to the Wiki.

And how to restrict Telegraf's container memory?

JuniusLuo commented 6 years ago

yep, upgrade will be supported for service created in 0.9.4 and 0.9.3.

Telegraf itself does not restrict the memory. We could leverage the container max memory/cpu limits. You could set the max-memory and max-cpuunits when creating the Telegraf service. This will set the max memory and cpu for the container. If Telegraf exceeds the max memory, container will be killed.

jazzl0ver commented 6 years ago

Could you please share the options for setting max-memory for Telegraf service creation command?

JuniusLuo commented 6 years ago

"max-memory" and "max-cpuunits"

JuniusLuo commented 6 years ago

Looks CLI help does not include these options. Will add it.

jazzl0ver commented 6 years ago

Have you updated the manage server and cli? Looks like the cli is still the old one:

-rwxrwxr-x junius/junius 7648808 2018-03-14 04:32 firecamp-service-cli

JuniusLuo commented 6 years ago

CLI does include the "max-memory" option. Just the help, such as firecamp-service-cli -op=create-service --help does not show the "max-memory" option. You could still use it.

jazzl0ver commented 6 years ago

I'm sorry - I meant the telegraf service absence:

# ./firecamp-service-cli -region=us-east-1 -cluster=firecamp-prod -op=create-service -help
Usage: firecamp-service-cli -op=create-service
...
  -service-type string
        The catalog service type: mongodb|postgresql|cassandra|zookeeper|kafka|kafkamanager|redis|couchdb|consul|elasticsearch|kibana|logstash
...

JuniusLuo commented 6 years ago

oops, uploaded the latest cli.

jazzl0ver commented 6 years ago

Works very well, thank you!

It would be great to have some storage metrics, like free/total space available
It would also be great to have some summarized metrics per cluster (like Latency, for example). Is that possible?

JuniusLuo commented 6 years ago

Unfortunately, cassandra does not provide them. For #1, cassandra storage load metrics provides "Total disk space used (in bytes) for this node", but not provide the free space for the node. In the later release, we could integrate with CloudWatch to create an alarm when the used space reaches some threshold of the total data volume size. For #2, cassandra only provides the per node metrics. not aggregate all nodes. You could easily create the dashboard for the "cassandraClientRequest_Latency_Mean" for all nodes. This would be enough?

jazzl0ver commented 6 years ago

Yeah, that's a good solution, thank you! I'll create a separate issue for CloudWatch alarm on the used space.

jazzl0ver commented 6 years ago

Are you aware why some metrics are not available? For example, Streaming metrics (http://cassandra.apache.org/doc/latest/operating/metrics.html)

JuniusLuo commented 6 years ago

Yes, not all metrics are monitored, such as Streaming metrics, CQL metrics, DroppedMessage metrics, etc. If you think some metrics is important and want to add, please let us know. Thanks.

jazzl0ver commented 6 years ago

Well, I think the aggregation of metrics across all keyspaces and tables are good to have. Streaming and DroppedMessage metrics seem also important.

JuniusLuo commented 6 years ago

Sounds good. We could add these 3 metrics.

jazzl0ver commented 6 years ago

It would be great to have an option to update the list of currently fetched metrics according to one's requirements. For example, at the moment around 100 metrics per node are fetched from Cassandra while I need just a few.

JuniusLuo commented 6 years ago

There are lots of Cassandra metrics. How do you want to configure it?

I am not sure if this is really necessary. Collecting 100 metrics per node would not impact Cassandra, as metric data is very small. If you only care a few, you could easily filter them on CloudWatch. It would be better to collect the important metrics. When something goes wrong, we may get some hints from the metrics.

jazzl0ver commented 6 years ago

It's not about impacting C*, it's about money: custom metrics cost ($0.30 x 100 = $30 per node per month) and it's not very wise to pay for the things you don't really need. I thought we could have a file with the list of metrics (one per line) that could be uploaded to the Telegraf service thru a firecamp-manager-cli update call. It would replace the current metrics list with the new one. Another "get" call might return the current list of metrics.

JuniusLuo commented 6 years ago

I see. This makes sense. CloudWatch is not cheap.

JuniusLuo commented 6 years ago

The commit was in. You could put all the custom metrics in one file, and pass the file "-tel-metrics-file=pathtofile" when creating the service.

please pay attention to the data format in the metrics file. Each line includes one metric. Every metric should have the quotation marks and end with comma. The last metri should not end with comma. Example:

    "/org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency",
    "/org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency",
    "/org.apache.cassandra.metrics:type=Storage,name=Load"

JuniusLuo commented 6 years ago

We just published 0.9.5 release, which supports Telegraf. You could try the latest firecamp quickstart.

If you have a cassandra service in 0.9.4 release, you could follow the upgrade guide to upgrade the cluster. While, there is one limit that you will have to stop all services first before upgrade. The upgrade will take around 10 minutes. Upgrade will be further enhanced in the next release.

jazzl0ver commented 6 years ago

That's great! Thanks for the implementation as well as for the upgrade feature! If I'm running the "latest" release, what are my steps to upgrade correctly?

JuniusLuo commented 6 years ago

Upgrade is not supported for the "latest" release. There is no way to know what needs to be upgraded between commits of the latest release.

cloudstax commented 6 years ago

The custom metrics is supported for Cassandra. Close this issue.

cloudstax / firecamp

Cassandra: monitoring #39