Closed tzach closed 10 years ago
I have searched some blogs on the subject and there are some informative ones
Extraction from this blog - some of it covered in the above - but others may be interesting as well
ReadStage, MutationStage, GossipStage tasks
With this metrics we can measure the activity in each server counting the number of operations. The three types are read, write and "gossip" (inter-node communication http://www.datastax.com/docs/1.1/cluster_architecture/gossip). We will gather the total CompletedTasks where we will see how many operations per minute are being executed, the ActiveTasks where we will see how many concurrent tasks are in each node, and the PendingTasks where we will see the "pending" queue length. With this data we can see a lot of things: for instance, if the number of PendingTasks grows consistently our node may be receiving more queries than it can handle, or maybe we ran out of disk space and, failing to write in the commitlog http://wiki.apache.org/cassandra/ArchitectureCommitLog, they are piling up (anyway, if this metric grows, something wrong is happening). If we see the load in our server grows, but also CompletedTasks increases at the same time, this may be "normal". We can find these values at: http:// $host:8081/mbean?objectname=org.apache.cassandra.request%3Atype%3DReadStage http:// $host:8081/mbean?objectname=org.apache.cassandra.request%3Atype%3DMutationStage http:// $host:8081/mbean?objectname=org.apache.cassandra.internal%3Atype%3DGossipStage
So I think that Completed and Pending are the more interesting ones and that we can split it up according to operation type - which may also be interesting - they can be stack area charted over time to provide a total of operations - please note that according to the description above the granularity is minute (need to check in datastax documentation)
The ActiveTasks is intesrting as number - so if we have already graphs like Eldan suggested with numbers on the side we can use that.
Compaction tasks
Normally they are related to activity in cluster. If there are lots of writes, usually there will be compactions. We will gather how many compactions are pending (PendingTasks) and completed (CompletedTasks), so we know how many there are, and if they're piling up. For instance, if we find a loaded server with a long compaction queue, we should think about putting down compaction priority (nodetool setcompactionthroughput 1), or if we see our queue grows consistently, we should think about disabling thrift (nodetool disablethrift) to stop receiving new queries, and giving max priority to compactions, to get rid of them the sooner the better (nodetool setcompactionthroughput 999). These metrics will also help us to know when a repair, or scrub/rebuild, or upgradesstables, etc. ended (although there is now a progress indicator for repairs, since v1.1.9 and 1.2.2). Anyway, if these values are usually not zero, we will have worries. The link: http:// $host:8081/mbean?objectname=org.apache.cassandra.db%3Atype%3DCompactionManager
Listed above we can also - graphs this over time for pending/completed (check granularity)
Latency
Here we will get the latency in operations. We want this value to be the lowest possible, and if it grows without reason we should find out why. We have 3 latency types, one for each operation: Range (RecentRangeLatencyMicros), Read (RecentReadLatencyMicros) and Write (RecentWriteLatencyMicros). http:// $host:8081/mbean?objectname=org.apache.cassandra.db%3Atype%3DStorageProxy
A latency graph for each operation type
Heap and NoHeap memory usage
Here we will find how much memory is available for Java, and how much of it is busy. We will get HeapMemoryUsage and NoHeapMemoryUsage. http://$host:8081/mbean?objectname=java.lang%3Atype%3DMemory -s
We may have that already from jvm info - but we may want to replicate this into the cassandra page
Número de GarbageCollections
Here we will gather GarbageCollections http://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29 in the system. This is related to the former metric (JavaHeap), because each GarbageCollection will free some memory. This will help us when the java process is GarbageCollecting too often and ends up wasting more time doing so than in its main task (read and write data!). We should check the GC frequency (ConcurrentMarkSweep). If it's too often, we may need to add some more memory to the java process. Anyway, we want this value to be the lowest possible. http:// $host:8081/mbean?objectname=java.lang%3Atype%3DGarbageCollector%2Cname%3DConcurrentMarkSweep
Outside JMX there are also interesting things
We may have that already from jvm info - but we may want to replicate this into the cassandra page
Number of connections
We want to know how many concurrent connections is Cassandra serving.
This way, if cassandra load increases, we can correlate it to a users
increase. If the number of users in our application doesn't grow but
cassandra connections do, something is wrong (the queries are slower, for
instance). If the number of cassandra connections increases, and so do the
number of users in our application, then this is "normal" and we should
improve Cassandra (assigning more resources, or tuning the configuration)
to fix it. This is a very interesting metric. It could be better, though.
It would be great if we could see what transactions are active in cassandra
(as does mysql show processlist
http://dev.mysql.com/doc/refman/5.1/en/show-processlist.html) so we
could see if there any badly constructed query or any that can be improved.
But given cassandra's architecture, this doesn't seem feasible, so we will
settle with the number of connections. I asked in cassandra-users
mailing list http://mail-archives.apache.org/mod_mbox/cassandra-user/
if there is any way to get this number
http://mail-archives.apache.org/mod_mbox/cassandra-user/201212.mbox/browser
and they answered there is not such thing, but the find it interesting
because it was frequently asked, so a developer ticket was created
https://issues.apache.org/jira/browse/CASSANDRA-5084. Some day it will
be implemented, I hope, and we will get his value from JMX. Meanwhile the
only way is netstat:
connections=netstat -tn|grep ESTABLISHED|awk '{print $4}'|grep 9160|wc -l
Lets skip this no jmx
To further squeeze Cassandra it's also interesting to analyze each ColumnFamily Data. This way we can see size, activity, cache sucess rate, secondary indexes, etc. But these are lots of queries to mx4j (about 21 for each ColumnFamily, about 2000 HTTP queries in my case!), and this information doesn't change so often, so I won't gather it at the moment, and when I do it, I'll get in 5-minutes interval, or 15 minutes, avoiding the server overload, so I'll put that in a separate script.
No info on how todo that
There is also an image : http://www.tomas.cat/blog/sites/default/files/xgraphite-dash.png.pagespeed.ic.AMqjgIo17k.png
AppDynamics provides a plugin for cassandra: https://www.appdynamics.com/database/cassandra/\
The interesting part here is per transaction breakdown - yet I suspect this is via their agent and not via Cassandra mbeans
MapEngine also has one: http://www.manageengine.com/products/applications_manager/cassandra-monitoring.html
Aside from JVM information that should be replicated into this page as well
On Thu, Sep 4, 2014 at 10:57 AM, Tzach Livyatan notifications@github.com wrote:
Cassandra (C) tab should be available only if C is running. It will present C* related information in charts and text box. Cluster Text info (mostly static)
- org.apache.cassandra.service.StorageService.Attributes.LiveNodes A set of the nodes which are visible and live, from the perspective of this node
- org.apache.cassandra.service.StorageService.Attributes.LoadMap A map of which nodes have what level of load (present as a table)
Compaction Manager text info (mostly static)
- org.apache.cassandra.db.CompactionManager.Attributes.MaximumCompactionThreshold The maximum number of SSTables in the compaction queue before compaction kicks off.
- org.apache.cassandra.db.CompactionManager.Attributes.MinimumCompactionThreshold The minimum number of SSTables in the compaction queue before compaction kicks off.
- org.apache.cassandra.db.CompactionManager.Attributes.PendingTasks The number of tasks waiting in the queue to be executed.
- org.apache.cassandra.service.StorageService.Attributes.Token A string describing the start of the range of keys this node is responsible for on the ring.
Charts
org.apache.cassandra.db.CompactionManager.Attributes.BytesCompacted
org.apache.cassandra.db.CompactionManager.Attributes.BytesTotalInProgress
DB charts
- org.apache.cassandra.db.CommitLog.Attributes.ActiveCount The number of tasks which are currently executing.
- org.apache.cassandra.db.CommitLog.Attributes.CompletedTasks The number of completed tasks.
source: http://wiki.apache.org/cassandra/JmxInterface
— Reply to this email directly or view it on GitHub https://github.com/cloudius-systems/osv-gui/issues/52.
@slivne, we both looked at the same blog post :) Much of your additions (compaction, read, write, gossip..) are already included. Can you please clean the long text to identify what metric you suggest to add?
For JVM info, we have #36 Please review and comment there.
These beans and the rest will be highly valuable to add as an application tab. Cheers!
On Thu, Sep 4, 2014 at 11:48 AM, Tzach Livyatan notifications@github.com wrote:
@slivne https://github.com/slivne, we both looked at the same blog post :) Much of your additions (compaction, read, write, gossip..) are already included. Can you please clean the long text to identify what metric you suggest to add?
For JVM info, we have #36 https://github.com/cloudius-systems/osv-gui/issues/36 Please review and comment there.
— Reply to this email directly or view it on GitHub https://github.com/cloudius-systems/osv-gui/issues/52#issuecomment-54433701 .
I have built and ran the Cassandra image but I don't seem to have the following MBeans in the Joloking API for some reason: org.apache.cassandra.interna/type=ReadStage org.apache.cassandra.interna/type=MutationStage
Also, the GossipStage returns the following JSON:
{
"CompletedTasks":0,
"PendingTasks":0,
"TotalBlockedTasks":0,
"ActiveCount":0,
"MaximumThreads":1,
"CoreThreads":1,
"CurrentlyBlockedTasks":0
}
which information is relevant to the chart?
Ok, here is my take to define what information we should extract and display
Cassandra (C) tab should be available only if C is running. It will present C* related information in charts and text box. Cluster - Text info (mostly static) - I don't think this is usefull at a node level - we can remove
Operation Completed chart - a single area stacked chart (pilling the values of all 3 - the sum is the total of all operations)
reads, write, gossip
delta of org.apache.cassandra.interna/type=MutationStage / sum (org.apache.cassandra.concurrent. *. http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrent ROW-MUTATION-STAGE
http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrent.ROW-MUTATION-STAGE CompletedTasks) http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrent.ROW-READ-STAGE
On the side of this chart we can add the active numbers - single number not delta
org.apache.cassandra.interna/type=MutationStage / sum( org.apache.cassandra.concurrent.*. http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrentROW-MUTATION-STAGE
http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrent.ROW-MUTATION-STAGE ActiveTasks) http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrent.ROW-READ-STAGE
Operation Pending chart - single chart - 3 lines
reads, write, gossip
delta of org.apache.cassandra.interna/type=ReadStage / sum( org.apache.cassandra.concurrent.*. http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrentROW-READ-STAGE Pending Tasks)
http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrent.ROW-READ-STAGE
delta of org.apache.cassandra.interna/type=MutationStage / sum( org.apache.cassandra.concurrent.*. http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrent ROW-MUTATION-STAGE
http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrent.ROW-MUTATION-STAGEPending Tasks) http://wiki.apache.org/cassandra/JmxInterface#org.apache.cassandra.concurrent.ROW-READ-STAGE
Total Latency Chart Over Time - single chart - 3 lines
(I am not sure this relates to operations above if at all)
Avg Latency Chart Over Time - (delta)/(delta) - single chart - 3 lines
(I am not sure this relates to operations above if at all)
Compaction Manager (two charts)
org.apache.cassandra.db.CompactionManager.Attributes.BytesTotalInProgress
DB (two charts)
JVM (one/two charts)
Two charts or one - copy paste from the JVM Tab
Heap / GC
OS (two charts)
Two charts fone for Disk IO and one for Networking IO
Disk IO (based on trace point counters) / Networking IO (based on trace point counters)
source:
On Thu, Sep 4, 2014 at 11:53 AM, dorlaor notifications@github.com wrote:
These beans and the rest will be highly valuable to add as an application tab. Cheers!
On Thu, Sep 4, 2014 at 11:48 AM, Tzach Livyatan notifications@github.com
wrote:
@slivne https://github.com/slivne, we both looked at the same blog post :) Much of your additions (compaction, read, write, gossip..) are already included. Can you please clean the long text to identify what metric you suggest to add?
For JVM info, we have #36 https://github.com/cloudius-systems/osv-gui/issues/36 Please review and comment there.
— Reply to this email directly or view it on GitHub < https://github.com/cloudius-systems/osv-gui/issues/52#issuecomment-54433701>
.
— Reply to this email directly or view it on GitHub https://github.com/cloudius-systems/osv-gui/issues/52#issuecomment-54434187 .
I sent my take on the info - I did not find them as well in the jmx link tzach provided - I did find others that provide the informaiton but we may need to aggregate their value
On Thu, Sep 4, 2014 at 12:50 PM, Lord Daniel Zautner < notifications@github.com> wrote:
I have built and ran the Cassandra image but I don't seem to have the following MBeans in the Joloking API for some reason: org.apache.cassandra.interna/type=ReadStage org.apache.cassandra.interna/type=MutationStage
Also, the GossipStage returns the following JSON:
{ "CompletedTasks":0, "PendingTasks":0, "TotalBlockedTasks":0, "ActiveCount":0, "MaximumThreads":1, "CoreThreads":1, "CurrentlyBlockedTasks":0}
which information is relevant to the chart?
— Reply to this email directly or view it on GitHub https://github.com/cloudius-systems/osv-gui/issues/52#issuecomment-54443105 .
What would be the best way to put some load on Cassandra to see changes in the latency data?
I have built and ran the Cassandra image but I don't seem to have the following MBeans in the > Joloking API for some reason: org.apache.cassandra.interna/type=ReadStage org.apache.cassandra.interna/type=MutationStage
Typo in my original post (now fix) 4 mbeans are
I used jconcole to connect to Cassandra (port 7199) and verify the above.
Should I show the ActiveCount/CompletedTasks with the GossipStage as well?
some progress:
I can not find the following MBeans either: org.apache.cassandra.db.CompactionManager.Attributes.BytesCompacted org.apache.cassandra.db.CompactionManager.Attributes.BytesTotalInProgress
EDIT: Also this one: org.apache.cassandra.db.CommitLog.Attributes.ActiveCount
I can not find the following MBeans either: org.apache.cassandra.db.CompactionManager.Attributes.BytesCompacted org.apache.cassandra.db.CompactionManager.Attributes.BytesTotalInProgress
There seems to be differences between the C* project, and the Datastax version. Here are the mbean, found with JConcole
Thanks, I was able to find them now
Should I show the ActiveCount/CompletedTasks with the GossipStage as well?
Yes. For the complected counters, you should present derivative in the chart, and absolute value in text format. No point in charting a monotonic increasing function.
Is TotalCompactionsCompleted also a counter?
Note, derivative shows the difference between two data points and our sampling rate is not constant so it might confuse users to think that the graph is displaying a "per time interval" (e.g. writes/s) data.
Note, derivative shows the difference between two data points and our sampling rate is not constant so it might confuse users to think that the graph is displaying a "per time interval" (e.g. writes/s) data.
Good point. This is why we added timestamp to our API. I see two alternatives:
On Thu, Sep 4, 2014 at 5:00 PM, Tzach Livyatan notifications@github.com wrote:
Note, derivative shows the difference between two data points and our sampling rate is not constant so it might confuse users to think that the graph is displaying a "per time interval" (e.g. writes/s) data.
You should divide the difference of the two values by the interval's length.
If you do that, you will get the proper "derivative". If you remember from calculus course, the derivative is actually the limit of the above ratio when the time interval approaches zero. Moreover, according to Cauchy's theorem (see http://en.wikipedia.org/wiki/Mean_value_theorem), when the interval has a finite (not approaching zero) size the ratio is still equal the derivative at some unknown point inside the interval.
Good point. This is why we added timestamp to our API. I see two alternatives:
- Show the value, not the derivative, in text format
- Show the derivative, but calculate it over the last 10 iterations
Why not a third option, of showing \delta f / \delta t?
Nadav Har'El nyh@cloudius-systems.com
You should divide the difference of the two values by the interval's length.
Wоuldn't that give us the "value per time interval" (writes/s)?
Should this issue be closed at this point?
Cassandra (C) tab should be available only if C is running. It will present C* related information in charts and text box.
Cluster - Text info (mostly static)
Operation charts
reads, write, gossip
Latency (charts)
Compaction Manager (charts)
DB (charts)
source: