apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.51k stars 1.31k forks source link

Lock for key ranges instead of entire DB #3337

Open xumengpanda opened 4 years ago

xumengpanda commented 4 years ago

Can we lock a key range in a DB while still allow clients operate on the other key ranges?

Two scenarios can benefit from this: 1) If data corruption happens at a small key range, say [b, c), the cluster can still serve traffic in the rest of key space, while we are restoring the data in [b, c).

2) If someone wants to move data in a small key range, say [b, c), from a live cluster A to another live cluster B, we can only lock that key range while still allow both clusters to serve traffic to the rest of key space.

The implementation is similar to Issue https://github.com/apple/foundationdb/issues/1044, except that we don't have to distinguish read and write lock.

jzhou77 commented 4 years ago

We are debating two designs:

  1. Client + Proxy locking. All proxies keep the locked ranges in memory and in sync such that for every transaction, if there are mutations read or modify keys in these locked ranges, the transaction is marked as failed. To reduce the work on the Proxies, each client caches the set of locked ranges so chat each client transaction can check locally without sending transaction data over. The cached ranges can be invalidated by a version (the mechanism is similar to metadata version key) piggybacked by the GRV response. If the version is obsolete, the client needs to retrieve the locked ranges from Proxies.

    • The advantage of this design is its simplicity and low overhead on the cluster, since the checking is done at the client side.
    • The disadvantage is that all clients need to fetch the locked ranges back in the GRV response. We’ll need to put a limit on the total allowed locked ranges, as well as the frequency of changing them, to reduce the overhead of sending many large GRV response back. Another drawback is that checking on the client side is potentially unsafe due to memory corruption and a malicious client can always bypass the check, but the risk is low (as we write the client library).
    • In the case that a client caches read version and reuses the version for a later transaction, if the read version is less than a later lock transaction, then the client can proceed to commit. When a Proxy receives the transaction, the Proxy needs to verify that the client has the latest cache. In this case, the client has a stale cache, so the Proxy needs to tell the client to retry with the latest locked ranges.
  2. Resolver + StorageServer locking. The locked ranges are kept in all resolvers such that for every transaction, if there are mutations change keys in these locked ranges, the transaction is marked as failed. For read only transactions, locking in resolver alone is not enough. That is, the locked ranges should also be present on all storage servers so that reads for keys within these ranges can be failed.

    • The advantage is checking is on the cluster side, which is safe and requires no client-side changes.
    • The disadvantage is a more complex design. Since ranges can be moved among storage servers, the information on the locked ranges need to be kept on all storage servers. We need to either keep a whole copy of locked ranges on all storage servers, or update each storage server’s locked ranges. Keeping a whole copy means a limiting factor of total number of ranges (since the information is kept in storage servers’ memory and is broadcasted) and a broadcast mechanism is needed to synchronize locked ranges from Proxies to all resolvers and storage servers (maybe by adding all tags to the mutation). Updating each storage server’s locked ranges is also very tricky, because DataDistributor can split or merge shards across different storage teams, thus requiring the locks to be moved along with the moved ranges. Since locks are probably stored in the system key space and the moved ranges are in the normal key space, the MoveKeys must update these two places atomically.

The contract we want to maintain is that: for a lock transaction T_L with a commit version C

jzhou77 commented 4 years ago

TODO lists after #3856 is merged: