Base128 encoding for keys

emmacasolin commented 2 years ago

Description

We need to allow all possible characters to exist in our keys in order to allow random IDs to be used as keys. This has been causing issues with parsing our key paths since the separator character (0x00) could appear in the middle of a key. The solution is to base encode our keys to guarantee that the separator will not be present in the encoded version that goes into the DB. For now, we will stick with base 128 encoding since it is more efficient than base 64 and since we can allow non-printable characters. We may eventually upgrade to using base 255 for additional efficiency in the future.

Issues Fixed

Fixes #27

Tasks

[x] 1. Create encoding/decoding utilities
[x] 2. Update parseKey and keyPathToKey utilities to do encoding/decoding of path parts
[x] 3. Update tests and add tests for encoding/decoding
[x] 4. Remove the escape byte and any escaping functionality as it will not be used

Final checklist

[x] Domain specific tests
[x] Full tests
[x] Updated inline-comment documentation
[x] Lint fixed
[x] Squash and rebased
[x] Sanity check the final build

ghost commented 2 years ago

👇 Click on the image for a new way to code review

- Make big changes easier — review code in small groups of related files - Know where to start — see the whole change at a glance - Take a code tour — explore the change with an interactive tour - Make comments and review — all fully sync’ed with github [Try it now!](https://app.codesee.io/r/reviews?pr=28&src=https%3A%2F%2Fgithub.com%2FMatrixAI%2Fjs-db)

Legend

CodeSee Map Legend

emmacasolin commented 2 years ago

I've added two utilities to src/utils:

// Create alphabet for encoding/decoding
const symbols: Array<number> = [];
// Start at 0x01 skipping 0x00
for (let i = 1; i < 129; i++) {
  symbols.push(i);
}
/**
 * Alphabet used for base encoding keys
 */
const alphabet = Buffer.from(symbols).toString('binary');

/**
 * Encode level or key part using base 128 encoding
 */
function encodePart(data: Buffer): string {
  // Start encoding
  const mask = (1 << 7) - 1
  let out = ''
  let bits = 0 // Number of bits currently in the buffer
  let buffer = 0 // Bits waiting to be written out, MSB first
  for (let i = 0; i < data.length; ++i) {
    // Slurp data into the buffer:
    buffer = (buffer << 8) | data[i]
    bits += 8
    // Write out as much as we can:
    while (bits > 7) {
      bits -= 7
      out += alphabet[mask & (buffer >> bits)]
    }
  }
  // Partial character:
  if (bits) {
    out += alphabet[mask & (buffer << (7 - bits))]
  }
  return out
}

/**
 * Decode level or string part from base 128 encoded string
 */
function decodePart(data: string): Buffer {
  const codes: Record<string, number> = {}
  for (let i = 0; i < alphabet.length; ++i) {
    codes[alphabet[i]] = i
  }
  // Count the padding bytes:
  let end = data.length
  while (data[end - 1] === '=') {
    --end
  }
  // Allocate the output:
  const out = new Uint8Array((end * 7 / 8) | 0)
  // Parse the data:
  let bits = 0 // Number of bits currently in the buffer
  let buffer = 0 // Bits waiting to be written out, MSB first
  let written = 0 // Next byte to write
  for (let i = 0; i < end; ++i) {
    // Read one character from the string:
    const value = codes[data[i]]
    if (value === undefined) {
      throw new SyntaxError(`Non-Base128 character`)
    }
    // Append the bits to the buffer:
    buffer = (buffer << 7) | value
    bits += 7
    // Write out some bits if the buffer has a byte's worth:
    if (bits >= 8) {
      bits -= 8
      out[written++] = 0xff & (buffer >> bits)
    }
  }
  return Buffer.from(out)
}

They were essentially copied straight from https://github.com/multiformats/js-multiformats/blob/master/src/bases/base.js but hard coded to work with base 128 rather than having the flexibility present in the original code. I also removed padding from the encoding function since we have no need for this https://github.com/MatrixAI/js-db/issues/27#issuecomment-1140442525.

The utilities appear to be working (able to encode and decode, decoded is the same as original), but I still need to test for lexicographic order once encoded.

CMCDragonkai commented 2 years ago

Great! As for testing lexicographic order, we currently have 4 tests in DB.test.ts:

lexicographic iteration order
lexicographic buffer iteration order
lexicographic integer iteration
sublevel lexicographic iteration

We should update these 2 tests with at least 100 random byte sequences.

lexicographic buffer iteration order
sublevel lexicographic iteration

Use nodeCrypto.randomBytes and give them random number of bytes. Like nodeCrypto.randomBytes(Math.ceil(Math.random() * 100)).

Generate a sequence of 100 of them. Then stick them into the DB as keys.

Then do an iteration, and you should expect the iteration order to be the same order as if you had sorted the original input buffers.

The tests already do this, but are limited in their samples that they are testing over.

As for the sublevel tests, it just needs to be done but this time with random sequences for each level, and random number of levels.

CMCDragonkai commented 2 years ago

Also this is a more efficient way of generating the alphabet:

// Lexicographically ordered base 128 alphabet
const alphabet = Buffer.from(
  Array.from({ length: 128 }, (_, i) => {
    return i + 1;
  })
);

Note that I've left it as a buffer.

You should change your encodePart and decodePart to return it as a Buffer. That way we don't need to go buffer to string and back again.

function encodePart(part: Buffer): Buffer;

function decodePart(data: Buffer): Buffer;

Avoid serialisation/deserialisation/encoding/decoding roundtrips where possible.

emmacasolin commented 2 years ago

Do we want db.dump to decode the keys? Considering it's mainly used for debugging it might be helpful to be able to see interpretable data, however it means that it's no longer dumping exactly what's in the db so that could add additional complexity/confusion as well. Maybe there should be an optional parameter for if you want the dump to be encoded or decoded?

CMCDragonkai commented 2 years ago

The dump should be decoding the keys. The usage of rawhere indicates whether we return as Buffer for both keys and values.

The dumped return type is currently:

Promise<Array<[string | Buffer, any>>

So if we really want to do be precise we would want to do something like:

Promise<Array<[KeyPath, any]>>

Where raw controls the any to be Buffer or some decoded value from JSON, or KeyPath can be [string | Buffer]. I suppose raw true would give you Buffer, while raw false would give you string.

CMCDragonkai commented 2 years ago

We can release this as 4.0.2 once it is ready. External API hasn't changed, and it's a bug fix.

emmacasolin commented 2 years ago

Setting the keyAsBuffer option to false for the iterator doesn't seem to make a difference to the expected output. So for example this throws a type error since the compiler doesn't know if kP is an array of strings or buffers.

const keysIterated: Array<string> = [];
for await (const [kP] of db.iterator({ keyAsBuffer: false, values: false })) {
  keysIterated.push(kP[0]); // error here
}

CMCDragonkai commented 2 years ago

Setting the keyAsBuffer option to false for the iterator doesn't seem to make a difference to the expected output. So for example this throws a type error since the compiler doesn't know if kP is an array of strings or buffers.
const keysIterated: Array<string> = [];
for await (const [kP] of db.iterator({ keyAsBuffer: false, values: false })) {
  keysIterated.push(kP[0]); // error here
}

That's expected. Use an assertion as string.

emmacasolin commented 2 years ago

We should update these 2 tests with at least 100 random byte sequences.

lexicographic buffer iteration order

sublevel lexicographic iteration

Use nodeCrypto.randomBytes and give them random number of bytes. Like nodeCrypto.randomBytes(Math.ceil(Math.random() * 100)).

Generate a sequence of 100 of them. Then stick them into the DB as keys.

Then do an iteration, and you should expect the iteration order to be the same order as if you had sorted the original input buffers.

For the lexicographic buffer iteration order test, randomising the byte length doesn't impact the lexicographic order, however, and the end of the test we convert all of the (lexicographically sorted) buffers into big-endian numbers and expect them to be in numeric order. This was fine for buffers with equal byte lengths, however, it doesn't apply to random byte lengths. Should this part just be removed from the test?

CMCDragonkai commented 2 years ago

Big integer was only for sanity checking? So it can be ignored now as long as we know that leveldb iteration order matches Buffer.compare order.

emmacasolin commented 2 years ago

We should update these 2 tests with at least 100 random byte sequences.

lexicographic buffer iteration order

sublevel lexicographic iteration

Use nodeCrypto.randomBytes and give them random number of bytes. Like nodeCrypto.randomBytes(Math.ceil(Math.random() * 100)).

Generate a sequence of 100 of them. Then stick them into the DB as keys.

Then do an iteration, and you should expect the iteration order to be the same order as if you had sorted the original input buffers.

The tests already do this, but are limited in their samples that they are testing over.

As for the sublevel tests, it just needs to be done but this time with random sequences for each level, and random number of levels.

Rather than modifying the sublevel lexicographic iteration test I've added a new test called sublevel lexicographic buffer iteration (since that test is testing iteration over a sublevel rather than over all levels). For the new test, I'm generating 100 random arrays of buffers, with each array having a random length of between 0-10 levels, and each level having a random length of 0-10 bytes (so this way the total level paths are between 0-100 bytes).

In order to turn these paths into keys (so that they can be compared for lexicographic order) I'm not using the keyPathToKey utility, since that encodes the key/level parts. Instead, I'm reimplementing that logic in the test and just omitting the encoding.

With that, this PR just needs to be squashed and rebased and it should be ready to merge into staging.

CMCDragonkai commented 2 years ago

In order to turn these paths into keys (so that they can be compared for lexicographic order) I'm not using the keyPathToKey utility, since that encodes the key/level parts. Instead, I'm reimplementing that logic in the test and just omitting the encoding.

In reimplementing the logic, do you mean you're just taking each buffer part of the keypath and concatenating them together into a buffer without the null byte separators and without the encoding? That's one way to do it.

Then when testing iteration order, you only need to iterate from the root data level, and see that the keys being iterated matches the expected order of all the input buffers.

emmacasolin commented 2 years ago

In order to turn these paths into keys (so that they can be compared for lexicographic order) I'm not using the keyPathToKey utility, since that encodes the key/level parts. Instead, I'm reimplementing that logic in the test and just omitting the encoding.

In reimplementing the logic, do you mean you're just taking each buffer part of the keypath and concatenating them together into a buffer without the null byte separators and without the encoding? That's one way to do it.

Then when testing iteration order, you only need to iterate from the root data level, and see that the keys being iterated matches the expected order of all the input buffers.

I'm adding the separators but not the encoding. This is all after the keys have gone into the db and been iterated over. The key paths are put into the db as key paths and the concatenation and separators are added to both the original and returned iterated key paths before sorting the original ones and comparing the order. The sepatators are needed because they affect the lexicographic order and the iterated keys are in that order.

MatrixAI / js-db