apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.05k stars 444 forks source link

Create an admin command that performs a comprehensive check for problems #4687

Open keith-turner opened 4 weeks ago

keith-turner commented 4 weeks ago

Accumulo has at least two admin command that check for some problems. The accumulo admin checkTablets will check for a few problems related to tablets. The accumulo admin fate command can look for table locks that reference a non existent fate operation. There may be other admin command that look for problem.

Having to know which admin command to run to look for problems is a bit problematic if one does not know what problems may exists. These existing checks could be consolidated under a new command like accumulo admin check that will check as many things as possible in accumulo looking for problems. Then this single command could be run to find any problems that may exists and would be a useful thing to run periodically.

The following is a list of things this new command could check. Some of these checks already exists in different places in the accumulo code base.

This command could potentially find a lot of problems. Some of these problems may have automated fixes. The command could optionally automatically fix those that can be fixed. The user would need a way to control what is automatically fixed. May not want to automatically fix all problems found.

keith-turner commented 4 weeks ago

Opened this issue after working on #4686 and trying to determine where that new functionality should go. The functionality added in #4686 could have gone under the existing admin checkTablets or admin fate command.

cshannon commented 3 weeks ago

This looks like it would be a really good feature to add. It would be nice to also optionally be able to configure the checks to run on some schedule and generate a report or something without having to always manually run the command. I'm not sure how you would report problems other than logs (maybe in the monitor?) but that could be useful. Of course another option is someone could just configure the command to run periodically with cron or something like that so we may not need to build in the automated running portion.

I could also see this command expanding with a few flags to make it more powerful such as being able to define which things to check, whether to automatically fix things, and probably plenty of other options that we would think of.

EdColeman commented 3 weeks ago

If run on a schedule, and similar to FATEs, there could be some simple metrics - maybe error count and run-time would be enough. A non-zero error count could then be used to prompt / alert to investigate further. The run time could allow for trending over time as a proxy for a performance measurement of the subsystems that are used in the check(s)

Metrics for timing sub-system checks would could add more fidelity if there are some discrete subsystems that are being used. (time to scan fate table, time to scan metadata table ...)

keith-turner commented 3 weeks ago

I could also see this command expanding with a few flags to make it more powerful such as being able to define which things to check

I suspect that would be needed because some checks will take a bit of time to run. If someone wants to run a specific check then they may not want to wait on everything to run.