cloudius-systems / osv-gui

OSv GUI
19 stars 14 forks source link

Application specific tab - Cassandra #52

Closed tzach closed 10 years ago

tzach commented 10 years ago

Cassandra (C) tab should be available only if C is running. It will present C* related information in charts and text box.

Cluster - Text info (mostly static)

reads, write, gossip

source:

slivne commented 10 years ago

I have searched some blogs on the subject and there are some informative ones

http://www.tomas.cat/blog/en/monitoring-cassandra-relevant-data-should-be-watched-and-how-send-it-graphite

Extraction from this blog - some of it covered in the above - but others may be interesting as well

  1. ReadStage, MutationStage, GossipStage tasks

    With this metrics we can measure the activity in each server counting the number of operations. The three types are read, write and "gossip" (inter-node communication http://www.datastax.com/docs/1.1/cluster_architecture/gossip). We will gather the total CompletedTasks where we will see how many operations per minute are being executed, the ActiveTasks where we will see how many concurrent tasks are in each node, and the PendingTasks where we will see the "pending" queue length. With this data we can see a lot of things: for instance, if the number of PendingTasks grows consistently our node may be receiving more queries than it can handle, or maybe we ran out of disk space and, failing to write in the commitlog http://wiki.apache.org/cassandra/ArchitectureCommitLog, they are piling up (anyway, if this metric grows, something wrong is happening). If we see the load in our server grows, but also CompletedTasks increases at the same time, this may be "normal". We can find these values at: http:// $host:8081/mbean?objectname=org.apache.cassandra.request%3Atype%3DReadStage http:// $host:8081/mbean?objectname=org.apache.cassandra.request%3Atype%3DMutationStage http:// $host:8081/mbean?objectname=org.apache.cassandra.internal%3Atype%3DGossipStage

    So I think that Completed and Pending are the more interesting ones and that we can split it up according to operation type - which may also be interesting - they can be stack area charted over time to provide a total of operations - please note that according to the description above the granularity is minute (need to check in datastax documentation)

    The ActiveTasks is intesrting as number - so if we have already graphs like Eldan suggested with numbers on the side we can use that.

  2. Compaction tasks

    Normally they are related to activity in cluster. If there are lots of writes, usually there will be compactions. We will gather how many compactions are pending (PendingTasks) and completed (CompletedTasks), so we know how many there are, and if they're piling up. For instance, if we find a loaded server with a long compaction queue, we should think about putting down compaction priority (nodetool setcompactionthroughput 1), or if we see our queue grows consistently, we should think about disabling thrift (nodetool disablethrift) to stop receiving new queries, and giving max priority to compactions, to get rid of them the sooner the better (nodetool setcompactionthroughput 999). These metrics will also help us to know when a repair, or scrub/rebuild, or upgradesstables, etc. ended (although there is now a progress indicator for repairs, since v1.1.9 and 1.2.2). Anyway, if these values are usually not zero, we will have worries. The link: http:// $host:8081/mbean?objectname=org.apache.cassandra.db%3Atype%3DCompactionManager

    Listed above we can also - graphs this over time for pending/completed (check granularity)

  3. Latency

    Here we will get the latency in operations. We want this value to be the lowest possible, and if it grows without reason we should find out why. We have 3 latency types, one for each operation: Range (RecentRangeLatencyMicros), Read (RecentReadLatencyMicros) and Write (RecentWriteLatencyMicros). http:// $host:8081/mbean?objectname=org.apache.cassandra.db%3Atype%3DStorageProxy

    A latency graph for each operation type

  4. Heap and NoHeap memory usage

    Here we will find how much memory is available for Java, and how much of it is busy. We will get HeapMemoryUsage and NoHeapMemoryUsage. http://$host:8081/mbean?objectname=java.lang%3Atype%3DMemory -s

    We may have that already from jvm info - but we may want to replicate this into the cassandra page

  5. Número de GarbageCollections

    Here we will gather GarbageCollections http://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29 in the system. This is related to the former metric (JavaHeap), because each GarbageCollection will free some memory. This will help us when the java process is GarbageCollecting too often and ends up wasting more time doing so than in its main task (read and write data!). We should check the GC frequency (ConcurrentMarkSweep). If it's too often, we may need to add some more memory to the java process. Anyway, we want this value to be the lowest possible. http:// $host:8081/mbean?objectname=java.lang%3Atype%3DGarbageCollector%2Cname%3DConcurrentMarkSweep

    Outside JMX there are also interesting things

We may have that already from jvm info - but we may want to replicate this into the cassandra page

  1. Number of connections

    We want to know how many concurrent connections is Cassandra serving. This way, if cassandra load increases, we can correlate it to a users increase. If the number of users in our application doesn't grow but cassandra connections do, something is wrong (the queries are slower, for instance). If the number of cassandra connections increases, and so do the number of users in our application, then this is "normal" and we should improve Cassandra (assigning more resources, or tuning the configuration) to fix it. This is a very interesting metric. It could be better, though. It would be great if we could see what transactions are active in cassandra (as does mysql show processlist http://dev.mysql.com/doc/refman/5.1/en/show-processlist.html) so we could see if there any badly constructed query or any that can be improved. But given cassandra's architecture, this doesn't seem feasible, so we will settle with the number of connections. I asked in cassandra-users mailing list http://mail-archives.apache.org/mod_mbox/cassandra-user/ if there is any way to get this number http://mail-archives.apache.org/mod_mbox/cassandra-user/201212.mbox/browser and they answered there is not such thing, but the find it interesting because it was frequently asked, so a developer ticket was created https://issues.apache.org/jira/browse/CASSANDRA-5084. Some day it will be implemented, I hope, and we will get his value from JMX. Meanwhile the only way is netstat: connections=netstat -tn|grep ESTABLISHED|awk '{print $4}'|grep 9160|wc -l

    Lets skip this no jmx

    • Data for each ColumnFamily*

    To further squeeze Cassandra it's also interesting to analyze each ColumnFamily Data. This way we can see size, activity, cache sucess rate, secondary indexes, etc. But these are lots of queries to mx4j (about 21 for each ColumnFamily, about 2000 HTTP queries in my case!), and this information doesn't change so often, so I won't gather it at the moment, and when I do it, I'll get in 5-minutes interval, or 15 minutes, avoiding the server overload, so I'll put that in a separate script.

    No info on how todo that

There is also an image : http://www.tomas.cat/blog/sites/default/files/xgraphite-dash.png.pagespeed.ic.AMqjgIo17k.png

AppDynamics provides a plugin for cassandra: https://www.appdynamics.com/database/cassandra/\

The interesting part here is per transaction breakdown - yet I suspect this is via their agent and not via Cassandra mbeans

MapEngine also has one: http://www.manageengine.com/products/applications_manager/cassandra-monitoring.html

Aside from JVM information that should be replicated into this page as well

On Thu, Sep 4, 2014 at 10:57 AM, Tzach Livyatan notifications@github.com wrote:

Cassandra (C) tab should be available only if C is running. It will present C* related information in charts and text box. Cluster Text info (mostly static)

  • org.apache.cassandra.service.StorageService.Attributes.LiveNodes A set of the nodes which are visible and live, from the perspective of this node
  • org.apache.cassandra.service.StorageService.Attributes.LoadMap A map of which nodes have what level of load (present as a table)

Compaction Manager text info (mostly static)

  • org.apache.cassandra.db.CompactionManager.Attributes.MaximumCompactionThreshold The maximum number of SSTables in the compaction queue before compaction kicks off.
  • org.apache.cassandra.db.CompactionManager.Attributes.MinimumCompactionThreshold The minimum number of SSTables in the compaction queue before compaction kicks off.
  • org.apache.cassandra.db.CompactionManager.Attributes.PendingTasks The number of tasks waiting in the queue to be executed.
  • org.apache.cassandra.service.StorageService.Attributes.Token A string describing the start of the range of keys this node is responsible for on the ring.

Charts

  • org.apache.cassandra.db.CompactionManager.Attributes.BytesCompacted

    org.apache.cassandra.db.CompactionManager.Attributes.BytesTotalInProgress

DB charts

  • org.apache.cassandra.db.CommitLog.Attributes.ActiveCount The number of tasks which are currently executing.
  • org.apache.cassandra.db.CommitLog.Attributes.CompletedTasks The number of completed tasks.

source: http://wiki.apache.org/cassandra/JmxInterface

— Reply to this email directly or view it on GitHub https://github.com/cloudius-systems/osv-gui/issues/52.

tzach commented 10 years ago

@slivne, we both looked at the same blog post :) Much of your additions (compaction, read, write, gossip..) are already included. Can you please clean the long text to identify what metric you suggest to add?

For JVM info, we have #36 Please review and comment there.

dorlaor commented 10 years ago

These beans and the rest will be highly valuable to add as an application tab. Cheers!

On Thu, Sep 4, 2014 at 11:48 AM, Tzach Livyatan notifications@github.com wrote:

@slivne https://github.com/slivne, we both looked at the same blog post :) Much of your additions (compaction, read, write, gossip..) are already included. Can you please clean the long text to identify what metric you suggest to add?

For JVM info, we have #36 https://github.com/cloudius-systems/osv-gui/issues/36 Please review and comment there.

— Reply to this email directly or view it on GitHub https://github.com/cloudius-systems/osv-gui/issues/52#issuecomment-54433701 .

dzautner commented 10 years ago

I have built and ran the Cassandra image but I don't seem to have the following MBeans in the Joloking API for some reason: org.apache.cassandra.interna/type=ReadStage org.apache.cassandra.interna/type=MutationStage

Also, the GossipStage returns the following JSON:


{
 "CompletedTasks":0,
 "PendingTasks":0,
 "TotalBlockedTasks":0,
 "ActiveCount":0,
 "MaximumThreads":1,
 "CoreThreads":1,
 "CurrentlyBlockedTasks":0
}

which information is relevant to the chart?

slivne commented 10 years ago

Ok, here is my take to define what information we should extract and display

Cassandra (C) tab should be available only if C is running. It will present C* related information in charts and text box. Cluster - Text info (mostly static) - I don't think this is usefull at a node level - we can remove

Operation Completed chart - a single area stacked chart (pilling the values of all 3 - the sum is the total of all operations)

reads, write, gossip

On the side of this chart we can add the active numbers - single number not delta

Operation Pending chart - single chart - 3 lines

reads, write, gossip

Total Latency Chart Over Time - single chart - 3 lines

(I am not sure this relates to operations above if at all)

Avg Latency Chart Over Time - (delta)/(delta) - single chart - 3 lines

(I am not sure this relates to operations above if at all)

Compaction Manager (two charts)

DB (two charts)

JVM (one/two charts)

Two charts or one - copy paste from the JVM Tab

Heap / GC

OS (two charts)

Two charts fone for Disk IO and one for Networking IO

Disk IO (based on trace point counters) / Networking IO (based on trace point counters)

source:

On Thu, Sep 4, 2014 at 11:53 AM, dorlaor notifications@github.com wrote:

These beans and the rest will be highly valuable to add as an application tab. Cheers!

On Thu, Sep 4, 2014 at 11:48 AM, Tzach Livyatan notifications@github.com

wrote:

@slivne https://github.com/slivne, we both looked at the same blog post :) Much of your additions (compaction, read, write, gossip..) are already included. Can you please clean the long text to identify what metric you suggest to add?

For JVM info, we have #36 https://github.com/cloudius-systems/osv-gui/issues/36 Please review and comment there.

— Reply to this email directly or view it on GitHub < https://github.com/cloudius-systems/osv-gui/issues/52#issuecomment-54433701>

.

— Reply to this email directly or view it on GitHub https://github.com/cloudius-systems/osv-gui/issues/52#issuecomment-54434187 .

slivne commented 10 years ago

I sent my take on the info - I did not find them as well in the jmx link tzach provided - I did find others that provide the informaiton but we may need to aggregate their value

On Thu, Sep 4, 2014 at 12:50 PM, Lord Daniel Zautner < notifications@github.com> wrote:

I have built and ran the Cassandra image but I don't seem to have the following MBeans in the Joloking API for some reason: org.apache.cassandra.interna/type=ReadStage org.apache.cassandra.interna/type=MutationStage

Also, the GossipStage returns the following JSON:

{ "CompletedTasks":0, "PendingTasks":0, "TotalBlockedTasks":0, "ActiveCount":0, "MaximumThreads":1, "CoreThreads":1, "CurrentlyBlockedTasks":0}

which information is relevant to the chart?

— Reply to this email directly or view it on GitHub https://github.com/cloudius-systems/osv-gui/issues/52#issuecomment-54443105 .

dzautner commented 10 years ago

What would be the best way to put some load on Cassandra to see changes in the latency data?

tzach commented 10 years ago

I have built and ran the Cassandra image but I don't seem to have the following MBeans in the > Joloking API for some reason: org.apache.cassandra.interna/type=ReadStage org.apache.cassandra.interna/type=MutationStage

Typo in my original post (now fix) 4 mbeans are

I used jconcole to connect to Cassandra (port 7199) and verify the above.

dzautner commented 10 years ago

Should I show the ActiveCount/CompletedTasks with the GossipStage as well?

dzautner commented 10 years ago

some progress: cassandratab

dzautner commented 10 years ago

I can not find the following MBeans either: org.apache.cassandra.db.CompactionManager.Attributes.BytesCompacted org.apache.cassandra.db.CompactionManager.Attributes.BytesTotalInProgress

EDIT: Also this one: org.apache.cassandra.db.CommitLog.Attributes.ActiveCount

tzach commented 10 years ago

I can not find the following MBeans either: org.apache.cassandra.db.CompactionManager.Attributes.BytesCompacted org.apache.cassandra.db.CompactionManager.Attributes.BytesTotalInProgress

There seems to be differences between the C* project, and the Datastax version. Here are the mbean, found with JConcole

dzautner commented 10 years ago

Thanks, I was able to find them now

tzach commented 10 years ago

Should I show the ActiveCount/CompletedTasks with the GossipStage as well?

Yes. For the complected counters, you should present derivative in the chart, and absolute value in text format. No point in charting a monotonic increasing function.

dzautner commented 10 years ago

Is TotalCompactionsCompleted also a counter?

dzautner commented 10 years ago

Note, derivative shows the difference between two data points and our sampling rate is not constant so it might confuse users to think that the graph is displaying a "per time interval" (e.g. writes/s) data.

tzach commented 10 years ago

Note, derivative shows the difference between two data points and our sampling rate is not constant so it might confuse users to think that the graph is displaying a "per time interval" (e.g. writes/s) data.

Good point. This is why we added timestamp to our API. I see two alternatives:

nyh commented 10 years ago

On Thu, Sep 4, 2014 at 5:00 PM, Tzach Livyatan notifications@github.com wrote:

Note, derivative shows the difference between two data points and our sampling rate is not constant so it might confuse users to think that the graph is displaying a "per time interval" (e.g. writes/s) data.

You should divide the difference of the two values by the interval's length.

If you do that, you will get the proper "derivative". If you remember from calculus course, the derivative is actually the limit of the above ratio when the time interval approaches zero. Moreover, according to Cauchy's theorem (see http://en.wikipedia.org/wiki/Mean_value_theorem), when the interval has a finite (not approaching zero) size the ratio is still equal the derivative at some unknown point inside the interval.

Good point. This is why we added timestamp to our API. I see two alternatives:

  • Show the value, not the derivative, in text format
  • Show the derivative, but calculate it over the last 10 iterations

Why not a third option, of showing \delta f / \delta t?

Nadav Har'El nyh@cloudius-systems.com

dzautner commented 10 years ago

You should divide the difference of the two values by the interval's length.

Wоuldn't that give us the "value per time interval" (writes/s)?

dzautner commented 10 years ago

Should this issue be closed at this point?