apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.06k stars 445 forks source link

Unhosted user tablets prevent balancing of metadata table #4515

Closed keith-turner closed 2 months ago

keith-turner commented 4 months ago

Describe the bug

When starting accumulo and assigning lots of tablets if the metadata table is not initially balanced then it will be prevented from balancing. This can leave the metadata tablets on much less tablet servers than possible at a time when the system is busy assigning and loading user tablets which generates a lot of load on the metadata table.

Overall this seems to be caused by the fact that the code related to balancing does not consider the different levels of Accumulo.

4475 is related to this larger problem. The specific code that prevents balancing is here, but fixing this issue would be a larger change that just that code.

This problem is present in 2.1 and later.

Expected behavior

The manager can balance the metadata table independently of what is going on with user tables.

EdColeman commented 4 months ago

Is there any chance that you started the manager before the tservers (or at least most of them)? If you start the tservers first, they will sit there waiting for assignments. When the manager starts, it will assign the metadata table before user tables and usually seemed to get distributed as it was on shutdown.

If you start the manager first, then as the tservers start, the manager will immediate begin assignments as soon as it sees the first tserver. This usually ends up with the metadata and a large number of tablets assigned to one (or very few) tservers - the rebalancing then will take a long time before things get back to normal.

There is a property MANAGER_STARTUP_TSERVER_AVAIL_MIN_COUNT to wait for N tservers before assignments start can mitigate this.

Balancing system tables separately would be a good feature, but there may be procedural things that can be done without code changes that help mitigate the issue from occurring.

keith-turner commented 4 months ago

Is there any chance that you started the manager before the tservers (or at least most of them)?

Started all tablet servers first.