MatrixAI / js-db

Key-Value DB for TypeScript and JavaScript Applications
https://polykey.com
Apache License 2.0
5 stars 0 forks source link

Change from JSON encoding to using CBOR/BSON encoding #58

Open CMCDragonkai opened 1 year ago

CMCDragonkai commented 1 year ago

Specification

Currently the DB uses JSON encoding by default for storing structured data.

This encoding is lossy. Not all of JS types can be represented using JSON, and in other cases it can be quite fat when encoding binary data.

Sometimes we want to store structured data that may include binary data and other useful things like Dates.

Remember things like undefined gets turned into null when in arrays, so that can be surprising.

Consider checking out CBOR (which seems an evolution from message pack and BSON).

Additional context

Tasks

  1. Compare the encoding of bufffers, typed arrays, dates, and undefined
  2. Compare the performance with JSON encoding
  3. Ensure that we get roundtrip isomorphism, what goes in, is what comes out, for random JS objects
  4. Ensure that CBOR supports additional JS "data types", and ultimately produces an ArrayBuffer that is accepted by the NAPI into rocksdb.
CMCDragonkai commented 1 year ago

Currently we are using things like:

And more just to represent the type that actually that comes out of the DB after we submit JSON because of how buffers are encoded. This adds quite a bit of unnecessary noise.

If the DB could support binary data, and support types that is native to JS like Buffer, Uint8Array... etc, it would be easier to avoid needing to have these types, and it would also be possible to discard the raw option entirely since data would be efficiently stored no matter what.

Non-native JS types like Buffer could be something that is explicitly supported by this DB, since it already uses Buffer alot.

Other JS types that could be supported include things like Set and Map... but that's unnecessary atm.

CMCDragonkai commented 1 year ago

This type would be particularly important:

/**
 * Strict JSON values.
 * These are the only types that JSON can represent.
 * All input values are encoded into JSON.
 * Take note that `undefined` values are not allowed.
 * `JSON.stringify` automatically converts `undefined` to `null.
 */
type JSONValue =
  { [key: string]: JSONValue } |
  Array<JSONValue> |
  string |
  number |
  boolean |
  null;

These types all need to be supported, and other kinds of values can be added to the list.

We could then create a DBValue type indicating all the types that are supported to be stored in the DB.

CMCDragonkai commented 1 year ago

This coincides with #3.

CMCDragonkai commented 1 year ago

Protobuf btw is not suitable for this. It must be schemaless. Other choices include messagepack too.

CMCDragonkai commented 1 year ago

BSON is old school and not suitable.

Protobuf requires a schema.

CBOR seems the best, but I think the libraries are sort of unmaintained.

This seems suitable: https://github.com/kriszyp/cbor-x

CMCDragonkai commented 1 year ago

This might be a breaking change in relation to #3. However it will make js-db far more user friendly and reduce the amount of encoding/decoding steps in Polykey, especially as we store alot of binary data into js-db like IDs. All of those encoding/decoding procedures could then be entirely eliminated as CBOR takes over.

CMCDragonkai commented 12 months ago

Using the CBOR library could be shared with PK when it needs to use it for binary streaming for mixed messages or chunked processing.