Closed CMCDragonkai closed 2 years ago
Note that we don't need padding in our base encoding, because we have null separators between the level parts and key parts. So ignore any part of the algo that has padding.
Even forking this might work: https://github.com/multiformats/js-multiformats/blob/master/src/bases/base64.js
Furthermore if we were to use a full 255 alphabet, we would use 15 bit size groups resulting in 2 byte output symbol space. This would mean mapping 2^15
space to 255 * 255
space (so technically it's "base 65025" encoding).
Why 15? Because at bit size groups 8, at minimum you would have to map each input group to 2 bytes output. This remains true until 16, which you would need 3 bytes to capture the full input space. This 15 bit size group is the largest optimal bit size group when groups have to map to 2.
I calculate that such an encoding would have 6.68 to 12.5 percent overhead (using the formula above multiplied by 2). Which is about half of the overhead of base 128 per above. Such an encoding also needs to ensure that the 2 byte output is sorted appropriately to ensure lexicographic ordering.
Note that there are other kinds of base encodings. However I'm not sure if they work for arbitrary strings.
To clarify:
0x00
byte.0x00 -> 0x01 0x01
and 0xFF -> 0x02 0x01
.N
is the total number of possible symbols which is N = 2**bitSize
.I've installed https://github.com/multiformats/js-multiformats/blob/master/src/bases/base64.js and modified the base64 algorithm to use base128. This was pretty straightforward since all you need to do is supply an alphabet and the bits per character. The encoding and decoding both seem to be working, I just need to test for lexicographic ordering.
You mean base64 right? That's what the base128 algo I wrote above is based on.
We shouldn't need to use multibase prefix, but the base codecs enable us to just do a direct base encoding with the prefix.
Rather than installing the full js-multiformats package, extract/copy the underlying functions that does base encoding and put it into our src/utils.ts
. I can see that it uses code in bases.js.
That way we aren't pulling in unnecessary packages.
Let's stick with base 128 right now as it is quickest way to solving this problem, we can optimise the overhead to base65K in the future. 15 percent overhead is acceptable.
After you've extracted the relevant code and refactored it to be minimal we can proceed with updating the parseKey and keyPathToKey functions.
Clarifying, that Buffer.compare
actually sorts aa
before z
. So longer lengths are not sorted later automatically.
See:
const arr = [
Buffer.from([0x01]),
Buffer.from([0x00, 0x00])
];
arr.sort(Buffer.compare);
console.log(arr);
// [ <Buffer 00 00>, <Buffer 01> ]
@emmacasolin the DB should be iterating the same way then.
I just checked the rfc4648 algo and some base64 impls. My initial thinking that it was a left pad on the partial-group was incorrect, it is in fact a right-pad. That means the mapped bytes is not quite correct. Like 0x01
doesn't map to0x01 0x02
, but instead to 0x01 0x41
.
Need to check if this affects the sorting algo.
Note that in order to get an equal chance of getting a 0
, it's important to to use Math.floor(Math.random() * 10))
.
Otherwise 0 has a minuscule chance of appearing.
So I'm changing the random creations to using Math.floor
. It's a more appropriate randomInteger
operation.
So I'm adding some test utilities to do this.
I've added some tests that demonstrate that the current base128 works and does preserve lexicographic ordering. It does 1000 different buffers of rnadom length between 0 to 101 inc-exc.
Furthermore there's also a sanity check test for Buffer.compare
behaviour, we can confirm that:
[]
empty buffer comes first[0x01]
is earlier than [0x00, 0x00]
- byte by byte rule[0x00, 0x00]
is earlier than [0x00, 0x00, 0x00]
- length ruleThis means whatever bugs lay in lexicographic ordering now, can only be in the leveldb.
We must also confirm that leveldb ordering works the same as Buffer.compare
rules.
I'll be adding a hotpatch for this after investigating the leveldb tests.
We will stay with rfc4648 algo, so right-padding is used for the bit groups, not left-padding, but it still works fine.
Specification
Our usage of
0x00
null byte in our separators and escapes is causing ambiguity.To resolve this, we can apply base encoding to our level parts and key part.
Base encoding would remove any possibility of
0x00
null byte in the output.The simplest base encoding to apply would be base255, where we have 255 possible output symbols that does not include the
0x00
byte.By doing this, null bytes don't appear in the stored level and key parts, and thus no ambiguity during parsing is possible. In fact parsing becomes simpler since we know that null bytes indicate the beginning and end of a part.
To understand how to do this, suppose we create a base algo called base128 (double of base64). And potentially evaluate the tradeoffs to base255.
The base encoding algorithms work like this. First we start with an alphabet.
In order to ensure lexicographic order, the alphabet must be in order. Base64 naturally does not have an ordered alphabet. But we can do this if we create our own encoding algorithm.
To create an 128-symbol alphabet:
Next we need to understand the encoding algorithm.
First we need to decide what the input bit-size group.
It is determined by
2**bitSize === numberOfSymbols
.So for a base128, the bit size would be 7.
This means we take an arbitrary byte sequence as input, and split into groups of bitsize 7.
Any left-over partial group has to be left-padded with bit
0
until it reaches a a group of bitsize 7.Then for each 7-group we map these to a symbol in our alphabet.
Here's a worked out example:
Notice that the input bytes are always at an 8-bit boundary. Meaning 1 byte is 8 bits, 2 bytes is 16 bits. Those bits are split into 7-bitsize groups. So for example
0x01
is00000001
. That is split into0000000
as the first group, and1
as the second group. Because the second group is less than7
bits in length, then it is left padded to0000001
. These 2 groups are now mapped into the alphabet. The alphabet starts at0x01
. Thus we get0x01 0x02
.You can work it out yourself and compare with the above mappings that I've worked out.
Notice that the resulting bytes would be in lexicographic order.
Now consider if there are 2 input bytes. The same idea applies.
Now there's a pattern to this. The output length of bytes is:
With base128 we get about
~15%
extra bytes. Which is pretty good compared to base64 which increases the bytelength by 33%.The algorithm for doing the bit-wise operations would be the same as how base64 does it. So just look for a javascript implementation of base64 and change the constants. There would be optimal ways of doing this very quickly.
Now what about base255? Suppose if we increase the size of the alphabet, surely we get more efficient algorithms that produce less overhead. Indeed there should be, but anything more than base128 should end up producing 2 character outputs for each 1 input character, unless we do it quite smart. I haven't worked it out fully, but the same idea should apply if we would use 15 bitsize groups to map into 2 byte outputs. However the algo I suspect may not as efficient. Not sure.
Additional context
Tasks
aa
is "greater" thanz
. So length matters too.