DBTransaction with Pessimistic Locking & Optimistic Locking and Deadlock Detection

Specification

The DBTransaction currently does not have any locking integrated. It is only a read-committed isolation-level transaction.

On top of this, users have to know to use iterators and potentially multiple iterators to access the snapshot guarantee of leveldb, this is essential when iterating over one sublevel, and needing to access properties of another sublevel in a consistent way.

Right now users of this transaction is expected to use their own locks in order to prevent these additional phenomena:

Repeatable read
Phantom reads
Lost-updates

Most of this comes down to locking a particular key that is being used, thus blocking other "threads" from starting transactions on those keys.

Key locking is required in these circumstances:

Atomically reading multiple keys for a consistent "composite" read.
Atomically writing multiple keys for a consistent "composite" write.
Reading from a key, and then writing to a key a value that is derived from the read (like the counter problem)
Atomically creating a new key-value, such that all operations also creating the same new key value coalesce and accept that that it has been created (this requires all creators to lock on the same "new" key)

Users are therefore doing something like this:

class SomeDomain {
  async someMethod(tran?: DBtransaction) {
    if (tran == null) {
      await withF([
        lockBox.lock(key1, key2), 
        db.transaction()
      ], async ([, tran]) => {
        return this.someMethod(tran);
      });
    }
    /*...*/
  }
}

Notice how if the transaction is passed in to SomeDomain.someMethod, then it doesn't bother creating its own transaction, but also doesn't bother locking key1 and key2.

The problem with this pattern is that within a complex call graph, each higher-level call has to remember, or know what locks need to be locked before calling an transactional operation of SomeDomain.someMethod. As the hierarchy of the callgraph expands, this requirement to remember the locking context grows exponentially, and will make our programs too difficult and complex to debug.

There are 2 solutions to this:

Pessimistic Concurrency Control (PCC)
- uses locks
- requires a deadlock detector (otherwise you may introduce deadlocks)
- locks should be locked in the same-order, horizontally within a call, and vertically across a callgraph
- transactions can be tried automatically when deadlock is detected
Optimistic Concurrency Control (OCC)
- does not use locks
- requires snapshot guarantees
- also referred to as "snapshot isolation" or "software transactional memory"
- transactions may be retried when guarantee is not consistent, but this depends on the caller's discretion

The tradeoffs between the 2 approaches are summarised here: https://agirlamonggeeks.com/2017/02/23/optimistic-concurrency-vs-pessimistic-concurrency-short-comparison/

Big database software often combine these ideas together into their transaction system, and allow the user to configure their transactions for their application needs.

A quick and dirty solution for ourselves will follow more along how RocksDB implemented their transactions: https://www.sobyte.net/post/2022-01/rocksdb-tx/. And details here: https://github.com/MatrixAI/js-polykey/issues/294#issuecomment-1106081466.

Pessimistic Concurrency Control

I'm most familiar with pessimistic concurrency control, and we've currently designed many of our systems in PK to follow along. I'm curious whether OCC might be easier to apply to our PK programs, but we would need to have both transaction systems to test.

In terms of implementign PCC, we would need these things:

Integrate the LockBox into DBTransaction
The LockBox would need to be augmented to detect deadlocks and manage re-entrant locks
Re-entrant locking means that multiple calls to lock key1 within the same transaction will all succeed. This doesn't mean that key1 is a semaphore, just that if it's already locked in the transaction, then this is fine to proceed.
Deadlock detection by ensuring that all locking calls always have a timeout, when timedout, it must then check a lock metadata table for transactions that were holding the locks that this transaction needed, and throw an exception regarding this.
Lock upgrades between read and write locks should also be considered, this means that if earlier call read-locked key1, a subsequent call can write-lock key1 (but must take precedence over other blocked readers & writers), and subsequent calls to write-lock key1 will also succeed. Lock downgrades will not be allowed.
After receiving a deadlock exception, this should bubble up to the transaction creator (or the unit of atomic operation designated as the request handler of the application to automatically retry)
All deadlocks detected are a programmer bug, but retrying should enable users to continue work. Therefore we may not do automatic retries and expect users to report the deadlock bug, and retry on their own discretion

As for optimistic transactions, we would do something possibly alot simpler: https://github.com/MatrixAI/js-polykey/issues/294#issuecomment-1106072194

Now there is already existing code that relies on how the db transactions work, namely the EncryptedFS.

Any updates to the DBTransaction should be backwards compatible. So that EncryptedFS can continue functioning as normal using its own locking system.

Therefore pessimistic and optimistic must be an opt-in.

For pessimistic, this may just mean adding some additional methods to the DBTransaction that ends up locking certain keys.

Optimistic Concurrency Control

For optimistic, this can just be an additional option parameter to the db.transaction({ optimistic: true }) that makes it an optimistic transaction.

Because OCC transactions are meant to rely on the snapshot, this means every get call must read from the iterator. Because this can range over the entire DB, the get call must be done on the root of the DB.

But right now iterator also creates their own snapshot. It will be necessary that every iterator call is iterating from the same snapshot that was created at the beginning.

Right now this means users must start their iterators at the beginning of their transaction if they were to do that.

This might mean we need to change our "virtual iterator" in DBTransaction to seek to snapshot iterator and acquire the relevant value there. We would need to maintain separate cursors for each iterator, and ensure mutual exclusion on the snapshot iterator.

When using optimistic transactions, this means every transaction creates a snapshot. During low-concurrency states, this is not that bad, and I believe leveldb does some sort of COW. So it's not a full copy. During high-concurrency states, this means increased storage/memory usage for all the concurrent snapshots. It is very likely that transactional contexts are only created at the GRPC handler level, and quite likely we would have a low-concurrency state for majority of the time for each Polykey node.

Based on these ideas, it seems OCC should be less work to do then PCC.

Additional context

https://github.com/MatrixAI/js-polykey/issues/294 - this is the overall issue tackling how concurrency works in PK, note that even if OCC was supported by DBTransaction, the usage of locks and LockBox will still apply in other areas that may only be interacting with in-memory state
https://github.com/MatrixAI/js-encryptedfs/pull/63 - the recent massive integration of the new "read committed" DBTransaction in to EFS, and its usage of LockBox
https://www.sobyte.net/post/2022-01/rocksdb-tx/ - rocksdb implementation of pessimistic and optimistic concurrency
https://www.mianshigee.com/tutorial/rocksdb-en/b603e47dd8805bbf.md - further details on rocksdb implementation, their default transaction is pessimistic and auto-locks every key that gets written to
4 - original issue discussing how snapshot isolation should be brought into DB, but we didn't fully understand its requirements or implications
https://ongardie.net/blog/node-leveldb-transactions/ - original attempt at implementing snapshot-isolation (optimistic) transactions in leveldb
http://ithare.com/databases-101-acid-mvcc-vs-locks-transaction-isolation-levels-and-concurrency/
http://www.tiernok.com/posts/adding-index-for-a-key-value-store/

Tasks

[ ] - Implement Snapshot Isolation and integrating it into DBTransaction - this should be enough to enable PK integration
[ ] - Enable advisory locking via DBTransaction.lock() call
[ ] - Ensure that DBTransaction.lock calls takes care or sorting, timeouts and re-entrancy
[ ] - Implement deadlock detection

Upon reading this http://ithare.com/databases-101-acid-mvcc-vs-locks-transaction-isolation-levels-and-concurrency/ this has clarified a few ideas that has been confusing.

The isolation levels mentioned before Read Committed, Read Uncommitted, Repeatable Read, Serializable were ideas that were first developed with "lock-based DBs", that is DBs that has no concept of snapshots and just use locks to manage concurrency.
Databases later developed "MVCC" which makes use of snapshots, this enabled "snapshot isolation" and optimistic concurrency control techniques. However DBs often retained the terminology of the above isolation levels, and the different concurrency phenomena (phantom reads, dirty reads, lost updates.. etc).
Implementations of MVCC seem to prefer the optimistic concurrency control, but you can always "reintroduce locking" into your optimistic transactions by using SELECT FOR UPDATE or SELECT FOR SHARE. So those features have always existed.
MVCC is easier to program with, you don't need to manage locks unless you really need it, and that's why FOR UPDATE and FOR SHARE still exist.
LevelDB does actually in fact implement MVCC, how ever it was not exposed to the end user, except with iterators. You cannot directly access the snapshot in LevelDB atm without iterators.
We may be able to turn our DBTransaction into an optimistic transaction by default, but also provide locking methods in case locks are necessary for special cases (such as when high concurrent pressure exists). This shouldn't affect backwards compatibility for EncryptedFS... but it will require some testing.

Re-reading:

This seems to mean that snapshot isolation should be easy to do.

However a new development was serializable snapshot isolation. This prevents "write skew" anomalies that exists in snapshot isolation, that currently can only be addressed by introducing locking.

I don't fully understand how to implement serializable snapshot isolation SSI, so I'll leave that for later. It seems SI is sufficient for most cases, and is how most DBs are still defaulting to.

Note, that if we had the ability to just use rocksdb, this stuff would mostly all be available for free. While leveldb does support rocksdb as the backing database https://github.com/Level/level-rocksdb, it doesn't expose any of the rocksdb transaction features discussed in https://www.mianshigee.com/tutorial/rocksdb-en/b603e47dd8805bbf.md which is unfortunate! I'm thinking that it will take more time to work out the C++ code and integration into nodejs right now (although we will have to do this soon). So I'll stick to attempting to implement this in JS/TS.

Note that we currently maintain written-data in our transaction data path. We used to call this the "snapshot". But this technically incorrect term. The snapshot is supposed to be the point-in-time state of the DB when the transaction starts. Additionally our implementation then implements COW on top of this.

This means that our transaction data path is where we are "buffering" our writes, and they are overlaid on top of the underlying DB.

In snapshot isolation, rather than overlaying on top of the underlying DB, they overlay on top of a "snapshot" of the underlying DB that was first created when the transaction started. This is only accessible via the LevelDB.iterator method which we have wrapped as DB._iterator(). Note that DB.iterator starts at the data sublevel while DB._iterator starts at the true root level.

Because DB._iterator() is not asynchronous, this can be setup during the constructor. But we have flexibility here in case we need to do asynchronous operations, either in DB.transaction()'s ResourceAcquire, or DBTransaction.createTransaction.

For detecting a write conflict at the end, I was originally under the impression that we only need to check whether the key-set being written to has a different set of values compared to transaction's snapshot.

However most of the literature on snapshot isolation instead talks about timestamps. Meaning that even if another transaction updated the value to the same value, this would still cause a write conflict.

I'm not sure why it doesn't just check if the values are different, and but maybe there's some edgecase where checking for value difference would not be enough to ensure consistency.

I've asked a question about this here https://stackoverflow.com/questions/72084071/can-the-validation-phase-of-transactions-in-software-transactional-memory-stm.

My guess atm is that it's faster to compare timestamps than to compare for value equality. The value might be quite large, but timestamp comparison is much easier. Also I believe these timestamps will be logical timestamps, not real-time timestamps. These logical timestamps would have to be strictly monotonic! We can do this just with a sequence counter starting at 0.

I want to note that our write buffer is placed in the DB (disk) itself and is therefore unbounded by memory. However our write operations are buffered in-memory and not in the DB. You might wonder why not push all write operations also into the DB transaction data path? Well because at the end you still end up reading them all in-memory to perform the write batch, so there's no advantage with putting the operations onto disk. Therefore no matter what our transaction sizes are still bounded by memory.

This does impact the scalability of large atomic operations such as those in EFS: https://github.com/MatrixAI/js-encryptedfs/issues/53.

https://www.sobyte.net/post/2022-01/rocksdb-tx/ states that RocksDB does not support SSI (serializable snapshot isolation). It only supports SI which is what we are heading towards to.

During the validation phase, it says:

rocksdb does not implement SSI, only conflict tracking for write operations, the process will be simpler than badger’s SSI, only need to check one thing, that is, at commit time, each Key in the db is not updated after the sequence number of the start of the current transaction exists.

The dumbest way to do this is to read the db once for each Key, get its most recent sequence number, and compare it to the transaction’s sequence number. If the sequence number in the db is larger, then sorry, there is a transaction conflict.

rocksdb will definitely not do something like reading IO N times for each Key in a transaction. Here is an idea, the execution time of the transaction is not long, in the conflict check scenario, you do not need to find all the historical data of the Key to make a judgment, but only the most recent data. The most recent data is MemTable, so to do the recent conflict detection, it is enough to read the data of MemTable, and there is no need to execute IO.

So rocksdb doesn't just reads the timestamp for every key that has been written, it has some optimisation making use MemTable. We don't really have anything similar to that, unless except the fact that we buffer up write operations in-memory, so if we exposed our writes with some queryable interface, this may be possible.

If we keep it simple and just read the timestamp for each key, for a large transaction, that would be alot of reads. And because we can only access this via the iterator, this means seeking our transaction iterator and performing reads off that to get the timestamp metadata. However our iterator has a "read ahead cache" controlled by the highWaterMarkBytes option. See: https://github.com/Level/classic-level#about-high-water. This doesn't exist on leveldown atm. The fillCache option enables the read cache which can also help. But this performance tuning we can worry about later (specifically after #11 is done).

The implementation of this can be separated into 2 phases:

The integration of the iterator as a snapshot to be used instead of the underlying DB. This means the transaction data path is used as the overlay on top of the snapshot. This impacts the DBTransaction.get and DBTransaction.iterator implementation.
The implementation of the commit write-set conflict detection. This may involve using timestamps and we need to figure out how the last updated timestamp is maintained including whether last updated timestamps are set as soon as an update is made (updated on put or when transaction is committed).

For the first phase:

The transaction will have a snapshot: DBIterator property.
DBTransaction get has to use the tran iterator. It first has to read the value from its own data overlay, then read it from the iterator.
Iterator get has to be a wrapper. Wrapper uses lock to mutual exclude operations. It must copy the original key, seek to the get key, then await the next, compare the returned key, if matching the get key return the value, if not return undefined, seek back to the original key. Seeking back to the original key may not be necessary if derived iterators always maintain their own key.
Alternatively it can use a new method called protected snapshotGet that acquires the data from the transaction iterator, and can ensure mutual-exclusion with a lock.
Delete doesn't have to use the main iterator, it uses the tombstone as usual.
Put doesn't have to use the main iterator, it just uses the COW data overlay.
As for DBTransaction.iterator, right now it creates a virtual iterator that iterates over cow overlay and DB simultaneously. It mow must iterate over cow overlay and transaction iterator simultaneously. It may be enough to simply re-use the transaction iterator. However creation and destruction are noops for transaction iterator. However because 2 iterators or more may be created in the same transaction, it is necessary to keep reference to the seeked key. This is always used before a next() call is made.
Consider that 2 derived iterators gets used and in the middle it is also calling get() in the iterators. This means on each next iteration and each get call there is mutual exclusion on the transaction iterator. At each step the the tran iterator must be seeked to the relevant key and response checked. If all derived iterators maintain their seek key, then tran iterator doesnt need to have its seek key restored. Effectively, derived iterators end up calling snapshotGet on the tran iterator.
Actually it may be a bit more difficult than that. The next call advances the the cursor. So when we are calling next on the tran iterator, we want to know where it advanced to. Seek sets the cursor, but next advances beyond it.

For the second phase:

The DB would need the last updated counter. But does this always increment or is it a counter that resets whenever there no concurrent transactions? Or does it matter?

Based on the discussion here https://github.com/MatrixAI/js-db/commit/6107d9c3ac55767113034bcedd19b379a5181a1d#commitcomment-73269633, the lmdb-js project exposes transactions that is combination of snapshot isolation for reads, but PCC serialisable locking for writes. It does not however have OCC snapshot isolation atm because it doesn't do write-set conflict detection.

I've started using rocksdb directly now to implement SI.

Some notes about it.

Rocksdb doesn't have SSI support, but there are projects built on top of rocksdb that does provide SSI.
SSI solves the write skew problem, a work around for solving the write skew problem is called "promotion", this is basically the usage of GetForUpdate call which does something similar to SELECT ... FOR UPDATE and it even SELECT ... FOR SHARE, it even has an exclusive boolean parameter that is by default true that I haven't tried to see what it means yet.
I've tested this promotion in tests/rocksdb/rocksdbP.test.ts and it works quite nicely.
Rocksdb offers OCC or PCC. But you cannot use both at the same time. This is different from the idea proposed above, where it should be possible to have a default layer of OCC, and opt-in PCC with locking. Since we are using OCC by default, we're using OptimisticTransactionDB in rocksdb, and we will need to do opt-in PCC locking at the JS-level and not rely on any of the locking features in rocksdb. It is likely more performant to do this anyway and not have to incur the JS-C call overhead of dealing with locks as JS async locks is just promise ordering.
RocksDB's optimistic transactions exposes a very flexible manipulation of snapshots that I didn't consider above:
- Snapshots can be set at the beginning of the transaction, this only affects the Commit() call, it does not affect the Get calls
- Without setting a snapshot at the beginning, the Commit() succeeds only when keys have not been written to after the first read within the transaction, this is similar to how SQL databases have worked. When set at the beginning it's instead based on when the time when the transaction was started.
- We can make this simple by always setting a snapshot in the beginning, and always using that snapshot for all reads, thus enabling both consistent reads, repeatable reads, and consistent iteration in one fell swoop.
- Setting the snapshot later, perhaps upon the first operation (think of this as lazy snapshot setting) might be more efficient, this is because one might start a transaction, have lots of time pass, and then start to read and write keys.
- I'm not sure, but perhaps snapshot should only be set upon the first read operation. But this might be premature optimisation.
- To allow flexibility in figuring out what we want to do, let's just expose snapshots themselves to the end user in JS, so we can manipulate it in JS.

The only optimisation for creating a snapshot upon registering the first lazy operation is just reducing the number of needless transaction conflicts. So if both T1 and T2 started, and you know they will conflict if they ran at the same time, but if T1 was sufficiently delayed in real time in starting their read/write operations, then it's better to delay snapshot creation to when the first operation is done.

If we can expose snapshots as first class types into JS although opaque, then this should be wieldable in JS.

Turns out RocksDB even has an operation to do this exact lazy setting of snapshot. It's called SetSnapshotOnNextOperation().

I believe we should have this set by default. However I realised that the docs don't say that it also set before Get() is called.

And if I want consistent reads, I'd need to get access to that snapshot and then set that as the option when performing the read.

So although it seems useful, I'm not sure if it's actually useful, since I'm not sure if it would be called upon the next Get() and secondly, I'd want to have that snapshot apply to the Get() or GetForUpdate() call.

I'd like to also point out that there's a ClearSnapshot() call available on the transaction, but that's apparently not the same thing as releasing the snapshot from the DB.

That SetSnapshotOnNextOperation is not what we are looking for. It's designed for setting a snapshot for controlling the commit, but does not allow any interception of the getters to get consistent reads.

In my thinking now, transactions should by default be consistent reads and consistent writes. Optimistic transactions are always consistent writes. I can't think of situations where we don't want consistent reads, if you did, you'd just use db* operations instead.

So the optimisation of setting the snapshot upon the first operation has to be done in JS-land under DBTransaction.

MatrixAI / js-db