Open EdColeman opened 5 years ago
The tablet server has a thread that continual checks if tablets need compaction or split.
The master has three really important threads that assign tablets for the root, metadata, and user tablets. If these threads are not running then tablets will not be assigned. These threads are all run by TabletGroupWatcher
There is currently code that watches compactions and logs a warning if it has not read or written any data in a certain amount of time. It addition to logging a warning it might be nice to increment a stuck compaction counter. Could decrement when unstuck.
There is also code that looks stuck tablet loads and logs a warning. Would also be nice to have a counter for stuck loads.
PR: https://github.com/apache/accumulo/pull/1379 - added improved metadata consistency checking - this may be another candidate for improved reporting. Initially considered adding it to the gc metrics improvements, but decided that a more comprehensive look and additional testing makes it better suited as a separate change.
This was originally proposed as metrics / monitoring at a level such that operator and app developers could gain insight into overall health and trends. Having the threads throw exceptions is great. But, this was more directed to allowing monitoring and trending of higher level functions - things that could be using multiple threads. @keith-turner provided some concrete examples. Knowing that the expected threads in the TabletGroupWatcher are running and possibly timing how long each run takes would allow metrics alerting and trending.
This is speculation and more of an description of something desired rather than a concrete example that I know happens. But assume that the thread handling user tablet assignments gets stuck or dies - if the manager keeps running then that is going to eventually be noticed through secondary effects - maybe its FATEs on table creates hang and backup or fail? Or its splits that start failing,... Exposing that function as a reportable metric could allow intervention sooner - or maybe it could be trended and if the thread starts taking longer and longer to run one could look what has changed and fix something before it falls over.
I thought a good place to do one of these was at the new consistency check thread. See https://github.com/apache/accumulo/pull/2583
Exposing metrics for critical process threads could improve monitoring and provide additional insight for performance trending.
For example, certain threads in master and in the tserver processes need to run periodically, if they do not, this is an indication that the process is likely unhealthy / having issues. Exposing the fact that the threads are running, progressing or had a successful completion of a task would improve monitoring capabilities Additionally, if the "run-time" was provided, this could be used to gauge relative health by trending the performance over time / across upgrades,....
This issue is to capture possible candidate threads / processes that would be beneficial to incorporate into metrics reporting.