apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.06k stars 445 forks source link

Consider adding critical thread metrics for monitoring #946

Open EdColeman opened 5 years ago

EdColeman commented 5 years ago

Exposing metrics for critical process threads could improve monitoring and provide additional insight for performance trending.

For example, certain threads in master and in the tserver processes need to run periodically, if they do not, this is an indication that the process is likely unhealthy / having issues. Exposing the fact that the threads are running, progressing or had a successful completion of a task would improve monitoring capabilities Additionally, if the "run-time" was provided, this could be used to gauge relative health by trending the performance over time / across upgrades,....

This issue is to capture possible candidate threads / processes that would be beneficial to incorporate into metrics reporting.

keith-turner commented 5 years ago

The tablet server has a thread that continual checks if tablets need compaction or split.

https://github.com/apache/accumulo/blob/c7a54c80b937fd606ebdd6b3672f837f65b6258f/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L2175

The master has three really important threads that assign tablets for the root, metadata, and user tablets. If these threads are not running then tablets will not be assigned. These threads are all run by TabletGroupWatcher

https://github.com/apache/accumulo/blob/050dec2003a786ea014c994a38e180a82b997c0d/server/master/src/main/java/org/apache/accumulo/master/TabletGroupWatcher.java#L136

There is currently code that watches compactions and logs a warning if it has not read or written any data in a certain amount of time. It addition to logging a warning it might be nice to increment a stuck compaction counter. Could decrement when unstuck.

https://github.com/apache/accumulo/blob/2b9c9275ea5f992cfa2bd1a7e3f8994a41e69df3/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactionWatcher.java#L31

There is also code that looks stuck tablet loads and logs a warning. Would also be nice to have a counter for stuck loads.

https://github.com/apache/accumulo/blob/b915947c0c22d9db717067b601c62829205d1505/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java#L437

EdColeman commented 4 years ago

PR: https://github.com/apache/accumulo/pull/1379 - added improved metadata consistency checking - this may be another candidate for improved reporting. Initially considered adding it to the gc metrics improvements, but decided that a more comprehensive look and additional testing makes it better suited as a separate change.

dlmarion commented 2 years ago

2524 added the monitoring of critical background threads, throwing an Error in the event that a critical background thread terminated abnormally. @EdColeman - do you think #2524 and the other merged PRs linked here are sufficient to close this issue?

EdColeman commented 2 years ago

This was originally proposed as metrics / monitoring at a level such that operator and app developers could gain insight into overall health and trends. Having the threads throw exceptions is great. But, this was more directed to allowing monitoring and trending of higher level functions - things that could be using multiple threads. @keith-turner provided some concrete examples. Knowing that the expected threads in the TabletGroupWatcher are running and possibly timing how long each run takes would allow metrics alerting and trending.

This is speculation and more of an description of something desired rather than a concrete example that I know happens. But assume that the thread handling user tablet assignments gets stuck or dies - if the manager keeps running then that is going to eventually be noticed through secondary effects - maybe its FATEs on table creates hang and backup or fail? Or its splits that start failing,... Exposing that function as a reportable metric could allow intervention sooner - or maybe it could be trended and if the thread starts taking longer and longer to run one could look what has changed and fix something before it falls over.

milleruntime commented 2 years ago

I thought a good place to do one of these was at the new consistency check thread. See https://github.com/apache/accumulo/pull/2583