Closed jazzl0ver closed 6 years ago
The Cassandra metrics could be collected using nodetool, JConsole, or JMX. See Cassandra Monitoring. The datadog blog you posted also mentioned the same methods. Datadog would also collect the Cassandra metrics using one of these ways.
The initial plan is to follow the standard way (nodetool) to get Cassandra metrics, and send the metrics/alert to CloudWatch Metrics/Alarms. We will not add any additional library to the Cassandra container. We could have a general policy-based framework. The framework allows the customer to customize the policy, such as the metric collecting interval, the metrics to collect, etc. The framework will schedule the task(s) accordingly. The task(s) will use nodetool to connect to Cassandra nodes to get metrics and send to CloudWatch. The task will end after that, so the task only consumes resource when it is running. Each service could define its own monitoring task.
We will define the standard metrics/alarm APIs, and have one implementation for CloudWatch. In the future, we could easily add the implementation for Azure/GCP and other implementation, which may use TICK.
Thanks for the detailed explanation of your point! I agree that injecting a side library does not sound very well, but:
Regarding the monitoring task. Do you really think it's a good idea to continuously start and stop it? For example, for 1 minute metrics collection interval, it seems to me it might be started just a moment later after it was stopped. Just b/c a lot of metrics must be collected, processed and injected into CloudWatch.
Good point! There are many existing monitoring solutions. We will explore the existing solutions first, and leverage the open source solutions as much as possible. We will only consider to build our own solution when we could not find the suitable solution.
Jolokia is a good tool. It actually supports the proxy mode. We could test to see if the proxy mode works for Cassandra.
Telegraf is a good project. It supports to get metrics for many services, and could send the metrics to CloudWatch. It may be a better framework than CollectD. This blog has a good comparison.
For the monitoring task, it would be ok to keep the monitoring task running. The monitoring service would be a better name than task. Assume the framework such as Telegraf only has a small memory footprint.
It would not be a problem to keep the task short as well. Collecting the metrics of one Cassandra or other service node will be fast, unless something happens. For example, Cassandra itself is stuck at gc. The metrics data will be small. The processing and sending to like CloudWatch would be fast as well. The monitoring collection/handling would not take more than a few seconds. But it requires more work for the scheduling framework. We could start with the long run monitoring service first.
It turns out adding Jolokia into Cassandra container is the simplest way. Telegraf is also supported. Monitoring Cassandra, Redis and ZooKeeper are supported. You could create a Telegraf service for the Cassandra service and see the metrics on CloudWatch. Please take a look and share your comments/suggestions.
Note: currently Cassandra Keyspaces and tables are not monitored. The system keyspaces introduces more than 1000 metrics. Further enhancements will be added to monitor the user keyspaces.
That's a great news!! What is the upgrade path? If possible, I wouldn't like to re-create our cassandra services.
Please, add Telegraf service creation tutorial to the Wiki.
And how to restrict Telegraf's container memory?
yep, upgrade will be supported for service created in 0.9.4 and 0.9.3.
Telegraf itself does not restrict the memory. We could leverage the container max memory/cpu limits. You could set the max-memory and max-cpuunits when creating the Telegraf service. This will set the max memory and cpu for the container. If Telegraf exceeds the max memory, container will be killed.
Could you please share the options for setting max-memory for Telegraf service creation command?
"max-memory" and "max-cpuunits"
Looks CLI help does not include these options. Will add it.
Have you updated the manage server and cli? Looks like the cli is still the old one:
-rwxrwxr-x junius/junius 7648808 2018-03-14 04:32 firecamp-service-cli
CLI does include the "max-memory" option. Just the help, such as firecamp-service-cli -op=create-service --help does not show the "max-memory" option. You could still use it.
I'm sorry - I meant the telegraf service absence:
# ./firecamp-service-cli -region=us-east-1 -cluster=firecamp-prod -op=create-service -help
Usage: firecamp-service-cli -op=create-service
...
-service-type string
The catalog service type: mongodb|postgresql|cassandra|zookeeper|kafka|kafkamanager|redis|couchdb|consul|elasticsearch|kibana|logstash
...
oops, uploaded the latest cli.
Works very well, thank you!
Unfortunately, cassandra does not provide them. For #1, cassandra storage load metrics provides "Total disk space used (in bytes) for this node", but not provide the free space for the node. In the later release, we could integrate with CloudWatch to create an alarm when the used space reaches some threshold of the total data volume size. For #2, cassandra only provides the per node metrics. not aggregate all nodes. You could easily create the dashboard for the "cassandraClientRequest_Latency_Mean" for all nodes. This would be enough?
Yeah, that's a good solution, thank you! I'll create a separate issue for CloudWatch alarm on the used space.
Are you aware why some metrics are not available? For example, Streaming metrics (http://cassandra.apache.org/doc/latest/operating/metrics.html)
Yes, not all metrics are monitored, such as Streaming metrics, CQL metrics, DroppedMessage metrics, etc. If you think some metrics is important and want to add, please let us know. Thanks.
Well, I think the aggregation of metrics across all keyspaces and tables are good to have. Streaming and DroppedMessage metrics seem also important.
Sounds good. We could add these 3 metrics.
It would be great to have an option to update the list of currently fetched metrics according to one's requirements. For example, at the moment around 100 metrics per node are fetched from Cassandra while I need just a few.
There are lots of Cassandra metrics. How do you want to configure it?
I am not sure if this is really necessary. Collecting 100 metrics per node would not impact Cassandra, as metric data is very small. If you only care a few, you could easily filter them on CloudWatch. It would be better to collect the important metrics. When something goes wrong, we may get some hints from the metrics.
It's not about impacting C*, it's about money: custom metrics cost ($0.30 x 100 = $30 per node per month) and it's not very wise to pay for the things you don't really need. I thought we could have a file with the list of metrics (one per line) that could be uploaded to the Telegraf service thru a firecamp-manager-cli update call. It would replace the current metrics list with the new one. Another "get" call might return the current list of metrics.
I see. This makes sense. CloudWatch is not cheap.
The commit was in. You could put all the custom metrics in one file, and pass the file "-tel-metrics-file=pathtofile" when creating the service.
please pay attention to the data format in the metrics file. Each line includes one metric. Every metric should have the quotation marks and end with comma. The last metri should not end with comma. Example:
"/org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency",
"/org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency",
"/org.apache.cassandra.metrics:type=Storage,name=Load"
We just published 0.9.5 release, which supports Telegraf. You could try the latest firecamp quickstart.
If you have a cassandra service in 0.9.4 release, you could follow the upgrade guide to upgrade the cluster. While, there is one limit that you will have to stop all services first before upgrade. The upgrade will take around 10 minutes. Upgrade will be further enhanced in the next release.
That's great! Thanks for the implementation as well as for the upgrade feature! If I'm running the "latest" release, what are my steps to upgrade correctly?
Upgrade is not supported for the "latest" release. There is no way to know what needs to be upgraded between commits of the latest release.
The custom metrics is supported for Cassandra. Close this issue.
Would like to open discussion on this topic. Some suggestions:
Please, share your thoughts.
http://jolokia.org/agent/jvm.html https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/ http://cassandra.apache.org/doc/latest/operating/metrics.html