MatrixAI / js-db

Key-Value DB for TypeScript and JavaScript Applications
https://polykey.com
Apache License 2.0
5 stars 0 forks source link

Intermittent test failure for write skew `getForUpdate` #42

Closed CMCDragonkai closed 2 years ago

CMCDragonkai commented 2 years ago

Describe the bug

Found intermittent test failures:

        ✕ transactionGetForUpdate addresses write skew by promoting gets into same-value puts (97 ms)
        ✕ transactionMultiGetForUpdate addresses write skew by promoting gets into same-value puts (96 ms)

These might be not deterministic on slower machines.

  ● rocksdbP › database › transactions › transactionGetForUpdate addresses write skew by promoting gets into same-value puts
    expect(received).toBe(expected) // Object.is equality
    Expected: true
    Received: false
      290 |             );
      291 |           }),
    > 292 |         ).toBe(true);
          |           ^
      293 |       });
      294 |       test('transactionMultiGetForUpdate addresses write skew by promoting gets into same-value puts', async () => {
      295 |         // Snapshot isolation allows write skew anomalies to occur
      at Object.<anonymous> (tests/rocksdb/rocksdbP.test.ts:292:11)
  ● rocksdbP › database › transactions › transactionMultiGetForUpdate addresses write skew by promoting gets into same-value puts
    expect(received).toBe(expected) // Object.is equality
    Expected: true
    Received: false
      366 |             );
      367 |           }),
    > 368 |         ).toBe(true);
          |           ^
      369 |       });
      370 |       test('transactionIteratorInit iterates over overlay defaults to underlay', async () => {
      371 |         await rocksdbP.dbPut(db, 'K1', '100', {});
      at Object.<anonymous> (tests/rocksdb/rocksdbP.test.ts:368:11)

To Reproduce

  1. See: https://gitlab.com/MatrixAI/open-source/js-db/-/jobs/2720130224

Expected behavior

Tests should be deterministic.

CMCDragonkai commented 2 years ago

It looks that these 2 failures are exactly where I'm expecting there to be a transaction conflict.

        expect(
          results.some((result) => {
            return (
              result.status === 'rejected' &&
              result.reason.code === 'TRANSACTION_CONFLICT'
            );
          }),
        ).toBe(true);
CMCDragonkai commented 2 years ago

It could be that if it runs fast enough, that the 2 transactions end up working and no transaction conflict occurs.

I need to run this a lot and see if this is still an issue.

CMCDragonkai commented 2 years ago

I was able to replicate it too.

CMCDragonkai commented 2 years ago

In order to make sure these tests are deterministic, we need to ensure that there is a synchronisation barrier between t1 and t2. https://en.wikipedia.org/wiki/Barrier_(computer_science)

This means both functions must only proceed to run after both side's transactionGetForUpdate or transactionMultiGetForUpdate calls are done.

One way to create an async barrier is to use a semaphore. Right now our @matrixai/async-locks doesn't abstract over the semaphore that async-mutex provides.

Imagine something like:

const barrier = new Barrier(2);

const t1 = async () => {
  await barrier();
};

const t2 = async () => {
  await barrier();
};

I've used something like this before in the PK code to allow promises as way to plug or unplug something like in queues. But promises in this way is just another way of "signalling" to another place in code. I believe these are called conditions https://en.wikipedia.org/wiki/Monitor_(synchronization).

We may need to use more of this to make our concurrent tests more deterministic. However I'd like a more comprehensive framework for this sort of stuff like from fast-check in the future.

CMCDragonkai commented 2 years ago

It's difficult to adapt the semaphore to this. Mainly because a barrier is the opposite of a semaphore. The semaphore allows X number of concurrent runners, after which they must be blocking. What we are saying, is that we need X approvals before it is allowed to proceed, after which all requests can proceed. The X approvals in this case comes from 2 places in the code: t1 and t2.

The semaphore can't take negative numbers, but also probably doesn't behave properly.

As an alternative we can have a counter and a lock. The lock starts out as already locked. When the counter reaches a designated number, then the lock is released.

const l = new Lock();
const release = await l.lock();
const c = 0;
const level = 2;
function barrier() {
  c++;
  if (c == level) {
    release();
  } else {
    await l.waitForUnlock();
  }
}

const t1 = async () => {
  await barrier();
};

const t2 = async () => {
  await barrier();
};
CMCDragonkai commented 2 years ago

This should be solved by 9ad374fc56b4ae2df8ca51480d8dc625ad526417. It adds in the usage of Barrier to ensure concurrent execution order.