Closed emmacasolin closed 2 years ago
I've added two utilities to src/utils
:
// Create alphabet for encoding/decoding
const symbols: Array<number> = [];
// Start at 0x01 skipping 0x00
for (let i = 1; i < 129; i++) {
symbols.push(i);
}
/**
* Alphabet used for base encoding keys
*/
const alphabet = Buffer.from(symbols).toString('binary');
/**
* Encode level or key part using base 128 encoding
*/
function encodePart(data: Buffer): string {
// Start encoding
const mask = (1 << 7) - 1
let out = ''
let bits = 0 // Number of bits currently in the buffer
let buffer = 0 // Bits waiting to be written out, MSB first
for (let i = 0; i < data.length; ++i) {
// Slurp data into the buffer:
buffer = (buffer << 8) | data[i]
bits += 8
// Write out as much as we can:
while (bits > 7) {
bits -= 7
out += alphabet[mask & (buffer >> bits)]
}
}
// Partial character:
if (bits) {
out += alphabet[mask & (buffer << (7 - bits))]
}
return out
}
/**
* Decode level or string part from base 128 encoded string
*/
function decodePart(data: string): Buffer {
const codes: Record<string, number> = {}
for (let i = 0; i < alphabet.length; ++i) {
codes[alphabet[i]] = i
}
// Count the padding bytes:
let end = data.length
while (data[end - 1] === '=') {
--end
}
// Allocate the output:
const out = new Uint8Array((end * 7 / 8) | 0)
// Parse the data:
let bits = 0 // Number of bits currently in the buffer
let buffer = 0 // Bits waiting to be written out, MSB first
let written = 0 // Next byte to write
for (let i = 0; i < end; ++i) {
// Read one character from the string:
const value = codes[data[i]]
if (value === undefined) {
throw new SyntaxError(`Non-Base128 character`)
}
// Append the bits to the buffer:
buffer = (buffer << 7) | value
bits += 7
// Write out some bits if the buffer has a byte's worth:
if (bits >= 8) {
bits -= 8
out[written++] = 0xff & (buffer >> bits)
}
}
return Buffer.from(out)
}
They were essentially copied straight from https://github.com/multiformats/js-multiformats/blob/master/src/bases/base.js but hard coded to work with base 128 rather than having the flexibility present in the original code. I also removed padding from the encoding function since we have no need for this https://github.com/MatrixAI/js-db/issues/27#issuecomment-1140442525.
The utilities appear to be working (able to encode and decode, decoded is the same as original), but I still need to test for lexicographic order once encoded.
Great! As for testing lexicographic order, we currently have 4 tests in DB.test.ts
:
lexicographic iteration order
lexicographic buffer iteration order
lexicographic integer iteration
sublevel lexicographic iteration
We should update these 2 tests with at least 100 random byte sequences.
lexicographic buffer iteration order
sublevel lexicographic iteration
Use nodeCrypto.randomBytes
and give them random number of bytes. Like nodeCrypto.randomBytes(Math.ceil(Math.random() * 100))
.
Generate a sequence of 100 of them. Then stick them into the DB as keys.
Then do an iteration, and you should expect the iteration order to be the same order as if you had sorted the original input buffers.
The tests already do this, but are limited in their samples that they are testing over.
As for the sublevel tests, it just needs to be done but this time with random sequences for each level, and random number of levels.
Also this is a more efficient way of generating the alphabet:
// Lexicographically ordered base 128 alphabet
const alphabet = Buffer.from(
Array.from({ length: 128 }, (_, i) => {
return i + 1;
})
);
Note that I've left it as a buffer.
You should change your encodePart
and decodePart
to return it as a Buffer
. That way we don't need to go buffer to string and back again.
function encodePart(part: Buffer): Buffer;
function decodePart(data: Buffer): Buffer;
Avoid serialisation/deserialisation/encoding/decoding roundtrips where possible.
Do we want db.dump
to decode the keys? Considering it's mainly used for debugging it might be helpful to be able to see interpretable data, however it means that it's no longer dumping exactly what's in the db so that could add additional complexity/confusion as well. Maybe there should be an optional parameter for if you want the dump to be encoded or decoded?
The dump
should be decoding the keys. The usage of raw
here indicates whether we return as Buffer
for both keys and values.
The dumped return type is currently:
Promise<Array<[string | Buffer, any>>
So if we really want to do be precise we would want to do something like:
Promise<Array<[KeyPath, any]>>
Where raw
controls the any
to be Buffer
or some decoded value from JSON, or KeyPath
can be [string | Buffer]
. I suppose raw
true would give you Buffer
, while raw
false would give you string
.
We can release this as 4.0.2 once it is ready. External API hasn't changed, and it's a bug fix.
Setting the keyAsBuffer
option to false for the iterator doesn't seem to make a difference to the expected output. So for example this throws a type error since the compiler doesn't know if kP
is an array of strings or buffers.
const keysIterated: Array<string> = [];
for await (const [kP] of db.iterator({ keyAsBuffer: false, values: false })) {
keysIterated.push(kP[0]); // error here
}
Setting the
keyAsBuffer
option to false for the iterator doesn't seem to make a difference to the expected output. So for example this throws a type error since the compiler doesn't know ifkP
is an array of strings or buffers.const keysIterated: Array<string> = []; for await (const [kP] of db.iterator({ keyAsBuffer: false, values: false })) { keysIterated.push(kP[0]); // error here }
That's expected. Use an assertion as string
.
We should update these 2 tests with at least 100 random byte sequences.
lexicographic buffer iteration order
sublevel lexicographic iteration
Use
nodeCrypto.randomBytes
and give them random number of bytes. LikenodeCrypto.randomBytes(Math.ceil(Math.random() * 100))
.Generate a sequence of 100 of them. Then stick them into the DB as keys.
Then do an iteration, and you should expect the iteration order to be the same order as if you had sorted the original input buffers.
For the lexicographic buffer iteration order
test, randomising the byte length doesn't impact the lexicographic order, however, and the end of the test we convert all of the (lexicographically sorted) buffers into big-endian numbers and expect them to be in numeric order. This was fine for buffers with equal byte lengths, however, it doesn't apply to random byte lengths. Should this part just be removed from the test?
Big integer was only for sanity checking? So it can be ignored now as long as we know that leveldb iteration order matches Buffer.compare
order.
We should update these 2 tests with at least 100 random byte sequences.
lexicographic buffer iteration order
sublevel lexicographic iteration
Use
nodeCrypto.randomBytes
and give them random number of bytes. LikenodeCrypto.randomBytes(Math.ceil(Math.random() * 100))
.Generate a sequence of 100 of them. Then stick them into the DB as keys.
Then do an iteration, and you should expect the iteration order to be the same order as if you had sorted the original input buffers.
The tests already do this, but are limited in their samples that they are testing over.
As for the sublevel tests, it just needs to be done but this time with random sequences for each level, and random number of levels.
Rather than modifying the sublevel lexicographic iteration
test I've added a new test called sublevel lexicographic buffer iteration
(since that test is testing iteration over a sublevel rather than over all levels). For the new test, I'm generating 100 random arrays of buffers, with each array having a random length of between 0-10 levels, and each level having a random length of 0-10 bytes (so this way the total level paths are between 0-100 bytes).
In order to turn these paths into keys (so that they can be compared for lexicographic order) I'm not using the keyPathToKey
utility, since that encodes the key/level parts. Instead, I'm reimplementing that logic in the test and just omitting the encoding.
With that, this PR just needs to be squashed and rebased and it should be ready to merge into staging.
In order to turn these paths into keys (so that they can be compared for lexicographic order) I'm not using the
keyPathToKey
utility, since that encodes the key/level parts. Instead, I'm reimplementing that logic in the test and just omitting the encoding.
In reimplementing the logic, do you mean you're just taking each buffer part of the keypath and concatenating them together into a buffer without the null byte separators and without the encoding? That's one way to do it.
Then when testing iteration order, you only need to iterate from the root data level, and see that the keys being iterated matches the expected order of all the input buffers.
In order to turn these paths into keys (so that they can be compared for lexicographic order) I'm not using the
keyPathToKey
utility, since that encodes the key/level parts. Instead, I'm reimplementing that logic in the test and just omitting the encoding.In reimplementing the logic, do you mean you're just taking each buffer part of the keypath and concatenating them together into a buffer without the null byte separators and without the encoding? That's one way to do it.
Then when testing iteration order, you only need to iterate from the root data level, and see that the keys being iterated matches the expected order of all the input buffers.
I'm adding the separators but not the encoding. This is all after the keys have gone into the db and been iterated over. The key paths are put into the db as key paths and the concatenation and separators are added to both the original and returned iterated key paths before sorting the original ones and comparing the order. The sepatators are needed because they affect the lexicographic order and the iterated keys are in that order.
Description
We need to allow all possible characters to exist in our keys in order to allow random IDs to be used as keys. This has been causing issues with parsing our key paths since the separator character (
0x00
) could appear in the middle of a key. The solution is to base encode our keys to guarantee that the separator will not be present in the encoded version that goes into the DB. For now, we will stick with base 128 encoding since it is more efficient than base 64 and since we can allow non-printable characters. We may eventually upgrade to using base 255 for additional efficiency in the future.Issues Fixed
Tasks
parseKey
andkeyPathToKey
utilities to do encoding/decoding of path partsFinal checklist