MatrixAI / js-db

Key-Value DB for TypeScript and JavaScript Applications
https://polykey.com
Apache License 2.0
5 stars 0 forks source link

Base Encoding for Keys #27

Closed CMCDragonkai closed 2 years ago

CMCDragonkai commented 2 years ago

Specification

Our usage of 0x00 null byte in our separators and escapes is causing ambiguity.

To resolve this, we can apply base encoding to our level parts and key part.

Base encoding would remove any possibility of 0x00 null byte in the output.

The simplest base encoding to apply would be base255, where we have 255 possible output symbols that does not include the 0x00 byte.

By doing this, null bytes don't appear in the stored level and key parts, and thus no ambiguity during parsing is possible. In fact parsing becomes simpler since we know that null bytes indicate the beginning and end of a part.

To understand how to do this, suppose we create a base algo called base128 (double of base64). And potentially evaluate the tradeoffs to base255.

The base encoding algorithms work like this. First we start with an alphabet.

In order to ensure lexicographic order, the alphabet must be in order. Base64 naturally does not have an ordered alphabet. But we can do this if we create our own encoding algorithm.

To create an 128-symbol alphabet:

const symbols: Array<number> = [];
// Start at 0x01 skipping 0x00
for (let i = 1; i < 129; i++) {
  symbols.push(i);
}
const alphabet = Buffer.from(symbols).toString('binary');
console.dir(alphabet);

Next we need to understand the encoding algorithm.

First we need to decide what the input bit-size group.

It is determined by 2**bitSize === numberOfSymbols.

2**7 === 128

So for a base128, the bit size would be 7.

This means we take an arbitrary byte sequence as input, and split into groups of bitsize 7.

Any left-over partial group has to be left-padded with bit 0 until it reaches a a group of bitsize 7.

Then for each 7-group we map these to a symbol in our alphabet.

Here's a worked out example:

Input Byte -> Input Binary -> Input Split as Bit Group -> Output Mapped Decimal -> Output Hex
0x00 -> 0000 0000 -> 0000000 0000000 -> 1 1 -> 0x01 0x01
0x01 -> 0000 0001 -> 0000000 0000001 -> 1 2 -> 0x01 0x02
0x02 -> 0000 0010 -> 0000001 0000000 -> 2 1 -> 0x02 0x01
0x03 -> 0000 0011 -> 0000001 0000001 -> 2 2 -> 0x02 0x02
0x04 -> 0000 0100 -> 0000010 0000000 -> 3 1 -> 0x03 0x01
0x05 -> 0000 0101 -> 0000010 0000001 -> 3 2 -> 0x03 0x02
0x06 -> 0000 0110 -> 0000011 0000000 -> 4 1 -> 0x04 0x01
0x07 -> 0000 0111 -> 0000011 0000001 -> 4 2 -> 0x04 0x02
0x08 -> 0000 1000 -> 0000100 0000000 -> 5 1 -> 0x05 0x01
...
0xfe -> 1111 1110 -> 1111111 0000000 -> 128 1 -> 0x80 0x01
0xff -> 1111 1111 -> 1111111 0000001 -> 128 2 -> 0x80 0x02

What about 2 byte inputs?

0x00 0x00 -> 0000 0000 0000 0000 -> 0000000 0000000 0000000 -> 1 1 1   -> 0x01 0x01 0x01
0x00 0x01 -> 0000 0000 0000 0001 -> 0000000 0000000 0000001 -> 1 1 2   -> 0x01 0x01 0x02
...
0xfe 0x00 -> 1111 1110 0000 0000 -> 1111111 0000000 0000000 -> 128 1 1 -> 0x80 0x01 0x01
0xff 0x00 -> 1111 1111 0000 0000 -> 1111111 1000000 0000000 -> 128 65 1 -> 0x80 0x41 0x00

Notice that the input bytes are always at an 8-bit boundary. Meaning 1 byte is 8 bits, 2 bytes is 16 bits. Those bits are split into 7-bitsize groups. So for example 0x01 is 00000001. That is split into 0000000 as the first group, and 1 as the second group. Because the second group is less than 7 bits in length, then it is left padded to 0000001. These 2 groups are now mapped into the alphabet. The alphabet starts at 0x01. Thus we get 0x01 0x02.

You can work it out yourself and compare with the above mappings that I've worked out.

Notice that the resulting bytes would be in lexicographic order.

Now consider if there are 2 input bytes. The same idea applies.

Now there's a pattern to this. The output length of bytes is:

outputByteLength = Math.ceil((inputByteLength * 8) / bitSize)

With base128 we get about ~15% extra bytes. Which is pretty good compared to base64 which increases the bytelength by 33%.

The algorithm for doing the bit-wise operations would be the same as how base64 does it. So just look for a javascript implementation of base64 and change the constants. There would be optimal ways of doing this very quickly.

Now what about base255? Suppose if we increase the size of the alphabet, surely we get more efficient algorithms that produce less overhead. Indeed there should be, but anything more than base128 should end up producing 2 character outputs for each 1 input character, unless we do it quite smart. I haven't worked it out fully, but the same idea should apply if we would use 15 bitsize groups to map into 2 byte outputs. However the algo I suspect may not as efficient. Not sure.

Additional context

  1. https://github.com/zanaptak/BinaryToTextEncoding
  2. https://stackoverflow.com/a/23170407/582917

Tasks

  1. Look for a opensource base64 algo (preferably written in JS)
  2. The nodejs source code is good, buffer encoding already supports base64, so you can just look there, or https://github.com/beatgammit/base64-js/blob/master/index.js
  3. Reimplement it but swap the alphabet with the one created above and change the bitsize group from 6 to 7. Base64 normally uses bitsize group of 6 (as `2^6 === 64)
  4. Then fuzz test with random keys to prove that lexi sort should work, remember aa is "greater" than z. So length matters too.
CMCDragonkai commented 2 years ago

Note that we don't need padding in our base encoding, because we have null separators between the level parts and key parts. So ignore any part of the algo that has padding.

CMCDragonkai commented 2 years ago

Even forking this might work: https://github.com/multiformats/js-multiformats/blob/master/src/bases/base64.js

CMCDragonkai commented 2 years ago

Padding is unnecessary because of https://stackoverflow.com/questions/4080988/why-does-base64-encoding-require-padding-if-the-input-length-is-not-divisible-by#:~:text=Padding%20allows%20us%20to%20decode,measuring%20in%20three%20byte%20bundles.

CMCDragonkai commented 2 years ago

Furthermore if we were to use a full 255 alphabet, we would use 15 bit size groups resulting in 2 byte output symbol space. This would mean mapping 2^15 space to 255 * 255 space (so technically it's "base 65025" encoding).

Why 15? Because at bit size groups 8, at minimum you would have to map each input group to 2 bytes output. This remains true until 16, which you would need 3 bytes to capture the full input space. This 15 bit size group is the largest optimal bit size group when groups have to map to 2.

I calculate that such an encoding would have 6.68 to 12.5 percent overhead (using the formula above multiplied by 2). Which is about half of the overhead of base 128 per above. Such an encoding also needs to ensure that the 2 byte output is sorted appropriately to ensure lexicographic ordering.

Note that there are other kinds of base encodings. However I'm not sure if they work for arbitrary strings.


To clarify:

  1. Base255 is an alphabet of 255 symbols. It doesn't have the 0x00 byte.
  2. Base255 would require 2 bytes as output-space, because in order to represent all 256 input symbols, one symbol would have to map to a 2-byte output.
  3. In order to simplify parsing, you would choose to map all input symbols map to 2-byte symbols.
  4. For example 0x00 -> 0x01 0x01 and 0xFF -> 0x02 0x01.
  5. Such a base255 would end up using a 8-bitsize group.
  6. Since you are using 2 bytes as the output space, you're not really making full use of the output space, it's inefficient utilisation.
  7. Therefore you should go to 15-bitsize group instead because it is just bitsize before 16, which would require 3 bytes as the output space.
  8. But once you go to 15-bitsize group, you're really using a much larger alphabet, it's sort of a misnomer to call it base255. Because your alphabet's symbols are actually 2-byte symbols. Thus it's more correct to call it baseN where N is the total number of possible symbols which is N = 2**bitSize.
  9. Any baseN encoding that requires 2 byte output space or more than 1 byte output space would need to ensure that the order of the bytes are placed in such a way that lexicographic order is maintained.
emmacasolin commented 2 years ago

I've installed https://github.com/multiformats/js-multiformats/blob/master/src/bases/base64.js and modified the base64 algorithm to use base128. This was pretty straightforward since all you need to do is supply an alphabet and the bits per character. The encoding and decoding both seem to be working, I just need to test for lexicographic ordering.

CMCDragonkai commented 2 years ago

You mean base64 right? That's what the base128 algo I wrote above is based on.

CMCDragonkai commented 2 years ago

We shouldn't need to use multibase prefix, but the base codecs enable us to just do a direct base encoding with the prefix.

Rather than installing the full js-multiformats package, extract/copy the underlying functions that does base encoding and put it into our src/utils.ts. I can see that it uses code in bases.js.

CMCDragonkai commented 2 years ago

That way we aren't pulling in unnecessary packages.

CMCDragonkai commented 2 years ago

Let's stick with base 128 right now as it is quickest way to solving this problem, we can optimise the overhead to base65K in the future. 15 percent overhead is acceptable.

After you've extracted the relevant code and refactored it to be minimal we can proceed with updating the parseKey and keyPathToKey functions.

CMCDragonkai commented 2 years ago

Clarifying, that Buffer.compare actually sorts aa before z. So longer lengths are not sorted later automatically.

See:

const arr = [
  Buffer.from([0x01]),
  Buffer.from([0x00, 0x00])
];
arr.sort(Buffer.compare);
console.log(arr);
// [ <Buffer 00 00>, <Buffer 01> ]

@emmacasolin the DB should be iterating the same way then.

CMCDragonkai commented 2 years ago

I just checked the rfc4648 algo and some base64 impls. My initial thinking that it was a left pad on the partial-group was incorrect, it is in fact a right-pad. That means the mapped bytes is not quite correct. Like 0x01 doesn't map to0x01 0x02, but instead to 0x01 0x41.

CMCDragonkai commented 2 years ago

Need to check if this affects the sorting algo.

CMCDragonkai commented 2 years ago

Note that in order to get an equal chance of getting a 0, it's important to to use Math.floor(Math.random() * 10)).

Otherwise 0 has a minuscule chance of appearing.

So I'm changing the random creations to using Math.floor. It's a more appropriate randomInteger operation.

So I'm adding some test utilities to do this.

CMCDragonkai commented 2 years ago

I've added some tests that demonstrate that the current base128 works and does preserve lexicographic ordering. It does 1000 different buffers of rnadom length between 0 to 101 inc-exc.

Furthermore there's also a sanity check test for Buffer.compare behaviour, we can confirm that:

This means whatever bugs lay in lexicographic ordering now, can only be in the leveldb.

We must also confirm that leveldb ordering works the same as Buffer.compare rules.

CMCDragonkai commented 2 years ago

I'll be adding a hotpatch for this after investigating the leveldb tests.

CMCDragonkai commented 2 years ago

We will stay with rfc4648 algo, so right-padding is used for the bit groups, not left-padding, but it still works fine.