apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.07k stars 445 forks source link

Investigate removing table locks for FATE ops #3823

Open keith-turner opened 1 year ago

keith-turner commented 1 year ago

With the changes to move to conditional mutations (start in #3251) to coordinate changes to tablets, we may be able to remove the tables locks in FATE. This would allow concurrent operations to run on different parts of a table. For example two merge operations could run on two different ranges of the same table. For merge the operations would need to detect overlapping claims on tablets and a have a mechanism to resolve that. Could have the operations with highest operation id relinquish its claim on a tablet and let the other operation claim it.

This would simplify one aspect of FATE operations as the conditional mutations adds complexity to another part of the FATE operation. Would still want to use FATE to drive operations across manager restarts.

keith-turner commented 1 year ago

3412 is related

keith-turner commented 1 year ago

3763 is a prerequisite for this change

keith-turner commented 1 year ago

The interaction of Merge with other operations would need the most attention if removing table locks. Currently the table locks prevents other operations from running concurrently with merge. The following is an initial pass of thinking through problems with table operations and merge.

User compactions

User compactions create two tablet metadata columns, one for running compactions and one for completed compactions. Without table locks, merge would need to handle these in some way. These columns can not be copied during a merge operation. Would conceptually want merge to wait on compaction or cancel them. A nice behavior would be if merge cancels running compactions and lets existing completed compaction markers be processed before starting to merge.

The above would make merge wait on completed compactions to be acknowledged without waiting on running compactions or allowing new compactions to start until after the merge is complete.

Bulk import

Bulk import adds loaded markers to tablets. During merge these loaded markers would need to be handled. Currently bulk import does the following.

  1. Checks that its load mapping aligns with actual split points in the table and if not throws a concurrent merge exception. The checks for merge that happened between file inspection to create the load mapping and acquiring the table lock.
  2. Starts loading files into tablets with the assumption that a concurrent merge will not happen because of the table lock. Write load columns to individual tablets.
  3. After all tablets are loaded, deletes the loaded mappings from each tablet.

The reason step 1 is done that files need to be imported into the exact tablets specified. However with the ability of files to now have ranges, it may be that if the load mapping does not align with the tables splits that we could add ranges to the bulk loaded files to handle this situation. The merge operation could also adjust these ranges on loaded mappings like it does for file mappings, this would allow bulk import to detect ranges that are not completely covered.

Adding ranges to the load mapping may also remove the need to ever throw concurrent merge exception in the bulk API, as concurrent merges could now be handled. So maybe this check could be modified or removed.

Split tablets

Split and merge both set operation ids on tablets, so this case is easy. They can wait on each via the operation id.

Merge

Concurrent merge operations that run on the same table with overlapping ranges could deadlock if acquiring operation ids is not done with care. If merge operations observer other merge operations in the metadata table, one of them could remove its operation ids to let the other proceed. Could be the merge operation with the highest fate tx id. Relinquishing opids would only need to be done when a merge operation is in the phase of acquiring operation ids on all of its tablets. Once a merge operation is past the acquisition phase, it can just complete and no longer needs to worry about relinquishing.

Clone

The clone code makes repeated passes over the source and destination metadata until convergence is confirmed. This clone code is currently split tolerant, would need to make this code also merge tolerant.

Delete table

If delete table acquires operation ids on a tablet, then it will mostly be fine with concurrent merges. However when a merge operation is in its acquisition phase, it should relinquish any operation ids it has obtained if it sees delete table operation ids.

Offline table

Currently taking a table offline would wait for any running merge to complete and after taking the table offline would prevent any merge from starting. Without table locks this would be hard to achieve without placing information in each tablet for an offline table. Maybe tablets could have a state of enabled or disabled (using terminology from #3860). This per tablet state could have the following properties.

Maybe the APIs for online/offline of a table goes away and turns conceptually into a concept of enabling and disabling ranges of a table.

Export table

Exporting a t table currently depends on the table being offline, so the conflict of offline with merge should consider export.

keith-turner commented 1 year ago

To address the problem outlined in the previous comment between merge and offlline table, could replace offline/online of table with per tablet states. Could have the following tablet states. Tablet states would generalize and replace the existing tablet operation id column. It would offer the same functionality that operation id currently offers, plus more with the concept of enabled and disabled.

Tablet state Description
NEW While a table is in the process of being created its tablets would have this state.
ENABLED A tablet in this state is available for read,write, compaction, split, merge, etc.
DISABLED A tablet in this state is only available for clone, delete, clone, and export table. All other operations should fail
SPLITTING A tablet in this state is currently splitting and is not available for other operations
MERGING A tablet in this state is currently merging and is not available for other operations
DELETING When a tablet is in this state, its table is in process of being deleted and its not available for any operations

Below is a table of valid tablet state transitions.

From to NEW to ENABLED to DISABLED to SPLITTING to MERGING to DELETING
NEW
ENABLED
DISABLED
SPLITTING
MERGING
DELETING

The table operations offline() and online() could possibly replaced with enable(RowRange range) and disable(RowRange range). The impl of offline() and online() could call enable(RowRange range) and disable(RowRange range) with infinite ranges.

keith-turner commented 1 year ago

Trying to work through the possibility of allowing bulk import and merge to run concurrently and it seems like it would be complex and not worth the effort. For example consider the situation where a bulk import wants to load file F1 into tablets T1, T2, and T3. Also concurrently a merge wants to merge tablets T1,T2, and T3.

Since the merge operation set an opid on T2, the bulk operation can not load into it. If the merge operation were to wait on the loaded markers on T1 and T2, then deadlock would happen. If the merge operation were to continue and set operation ids on T1,T3 and then proceed with the merge, then it would need to merge the loaded markers. If T1,T2,T3 were merged into tablet T4, then T4 would need load markers with ranges so that the bulk import code can reason about the missing T2 range that is needed to complete the bulk import.

Adding ranges to the load markers introduces a lot of complexity into the bulk import code when there is not a strong need for it. It would be easier to somehow prevent bulk import and merge from running concurrently. Both are quick metadata operations (unlike compactions) so there is not performance penalty to preventing them from running concurrently. Without table locks, a possible way to do this is to add a new metadata column that prevents merges from starting. Bulk import could then do the following.

If the merge block and merge opid are both observed, then one operation needs to back out its changes and wait to let the other operation start.

keith-turner commented 1 year ago

For concurrent merge and bulk import, thought of the following strategy that I like.

With the above strategy bulk import and merge will not run concurrently. Also bulk import does not have to take any extra step, only the merge code. So since bulk import will run much more frequently than merge, its good to not add extra steps to bulk import to handle a rare condition. One downside with this strategy is that its possible that merge could never run if there are continuous bulk imports.

keith-turner commented 1 year ago

Met with @dlmarion, @cshannon, @ddanielr , and @EdColeman to discuss this issue, below are notes from the meeting for thing I can remember. Please add more comments if there are things I missed. Meeting info here

With table locks the current behavior is that a merge operation will wait on a compaction or bulk import. Consensus was it would be good to maintain this behavior.

With the current table locks, a merge can prevent compactions and bulk imports from running even the ranges are disjoint. Like in the the following situation.

  1. Really long running user compaction is started on range (a,m]
  2. Merge is initiated on range (o,p], however it can not start because of the compaction. Places a write lock entry in the zookeeper table lock queue.
  3. A bulk import for range (x,z] is initiated, but it blocks when is sees there is a write lock queued for the above merge.

So in the situation above a bulk import is blocked by a merge that is blocked by a long running user compaction. Since these ranges are disjoint, if locking were moved to the tablet level all of the operations could proceed.

Adding a tablet state of disabled would be useful for manual metadata edits. We discussed how this aligns with the tablet hosting goal. If a tablet has a hosting goal of never, then it will not be hosted but in elasticity compactions, splits, bulk import, etc could still change the tablet metadata.

For exporting a table, potentially having a partially disabled tablet is kinda confusing. The export table operation could verify that all tablets are in the disabled state when it starts, but there is no way to prevent this from changing later. Discussed have another tablet state of readonly or snapshot (snapshot is a terminal read only state). So tablet export could require all the tablet to be in the snapshot state before it would start.

Discussed where the valid state transitions for tablet might exists in the code. This could exists in ample, where it checks the transitions before its made.

If write locks were removed and fate storage was moved to a zookeeper table, then bulk imports may have zero zookeeper writes which may help with scalability.