MatrixAI / js-db

Key-Value DB for TypeScript and JavaScript Applications
https://polykey.com
Apache License 2.0
5 stars 0 forks source link

WIP: Integrating Automatic Indexing into DB #2

Closed CMCDragonkai closed 1 year ago

CMCDragonkai commented 2 years ago

Description

Adding automatic indexing. Inspired by level-auto-index. But this has to be adapted to DB due to encryption and level needs.

Because indexing is so complicated, we are not going to integrate all forms of indexing. If we need these, we would be better off swapping to using sqlite3 which is a much bigger change.

So for now the only type of indexing is:

The way this works is that it takes a property of the value, which has to be a POJO. This value must be "hashable". This means the value must be convertible to be a string or value. I imagine something that has the toString and valueOf methods. There's already a ToString interface that could be used. Alternatively one could instead ask for the ability to JSON encode whatever the property is and hash that. Note that we would have to use the canonical json encoding to ensure that it is deterministic.

Once we have this stringified value, then we can proceed to hash the value. We can use a cryptographically strong hash. In fact the reason to use a cryptographically secure hash is simply to prevent preimage attacks. If we aren't interested in preventing preimage attacks, then this is just an unnecessary step where we can just put the value as a plaintext key. Therefore if we want to protect the value, it makes sense to just hash the value.

To prevent any chance of collision, we would have to ensure that the value we are hashing is smaller than the hash size. Of course collision is very unlikely. So we should just use SHA256 or SHA512 (note that SHA512 is faster than SHA256 but uses double the space, so one can use a SHA512/256, but why bother for now).

Note that until Order Preserving Encryption becomes more easier to use, this means such indexes loses their order. So they can only be used as a "hashtable" style index. There's no order to the indexes.

If users want to have order, they will need to bypass the hashing and just manually create their index just by creating their own sublevels.

Having a unique value index (hashtable style) should be sufficient for alot of usecases.

One problem is dealing with non-unique values. Like for example vault tags. If each vault can have multiple tags. One can imagine that a tag can point to multiple primary ids. In that case we may at the very least generalise our index values to contain multiple ids.

The level-auto-index also supports compound indexes by concatenation. We will avoid this for now. So this means what will be missing are:

If we find ourselves really needing any of the above, it is better we use sqlite3. And for encryption, it would nice if we can put sqlite3 or leveldb on top of a block mapping system, and encryption is applied to the block mapping system, not the sqlite3/leveldb keys & values. I think this should be possible since others have implemented sqlite3 on top of indexdb. https://nolanlawson.com/2021/08/22/speeding-up-indexeddb-reads-and-writes/

So an ideal future situation might be: sqlite3/leveldb on top of an encrypted blockdb (indexeddb) so that encryption happens at a lower level. This would eliminate all the encryption problems and allow keys and values to be encrypted. But this will be challenging for cross-platform scenarios. #5

Issues Fixed

Tasks

  1. ~[ ] - Integrate hooks/events to the get/put/batch~ - concluded that hooks and eventemitter are not sufficient
  2. ~[ ] - Prototype embedding indexing and encryption into the abstract-leveldown layers of leveldb ecosystem` - not doing this because it would be an inefficient stop-gap)~
  3. [ ] - Prototype indexing on top of DB as an alternative outside of the leveldb ecosystem
  4. [ ] - Create indexing API
    • ensure that multi-value indexes can be used, not all indexes are unique (try with subleveldown keys)
      • vault names are unique
      • node ids in claims is not, there are multiple claims with the same node id
  5. [ ] - ...

Final checklist

CMCDragonkai commented 2 years ago

One usecase from sigchain is to find the latest claim for a given node id. So you want to do a O(1) lookup by the node id. You then get a list of claim ids or claim values. And you want to find the latest claim from either the list of claim ids or claim values.

If you use the autoindexing to automatically resolve the claim ids, this means you are getting alot of claim values. But if we only want the latest, this isn't that efficient. The claim id itself is ordered using IdSortable, so it's possible to sort these ids.

Additionally the index itself can preserve the order as well on write. That's probably better and will be easier.

So I need to prototype these choices.

Additionally for vault usecase, it's about knowing whether a vault name is unique or has been used. So it's a uniqueness constraint.

CMCDragonkai commented 2 years ago

Realised I forgot to bring in @types/level and @types/subleveldown to get us the types we are expecting from level these were originally in js-polykey, but now here.

We also need explicit @types/abstract-leveldown since we are explicitly using abstract-leveldown dependency.

     "devDependencies": {
  +    "@types/abstract-leveldown": "^5.0.2",
       "@types/jest": "^26.0.20",
  +    "@types/level": "^6.0.0",
       "@types/node": "^14.14.35",
       "@types/node-forge": "^0.10.4",
  +    "@types/subleveldown": "^4.1.1",

Those dependencies may not be needed in PK unless they are explicitly used outside of js-db @joshuakarp.

CMCDragonkai commented 2 years ago

Some things I recently realised:


LevelDB interface ```ts interface LevelDB extends LevelUp> { errors: typeof errors; } ```
LevelUp interface ```ts export interface LevelUp> extends EventEmitter { open(): Promise; open(callback?: ErrorCallback): void; close(): Promise; close(callback?: ErrorCallback): void; put: InferDBPut; get: InferDBGet; del: InferDBDel; clear: InferDBClear; batch(array: AbstractBatch[], options?: any): Promise; batch(array: AbstractBatch[], options: any, callback: (err?: any) => any): void; batch(array: AbstractBatch[], callback: (err?: any) => any): void; batch(): LevelUpChain; iterator(options?: AbstractIteratorOptions): Iterator; isOpen(): boolean; isClosed(): boolean; createReadStream(options?: AbstractIteratorOptions): NodeJS.ReadableStream; createKeyStream(options?: AbstractIteratorOptions): NodeJS.ReadableStream; createValueStream(options?: AbstractIteratorOptions): NodeJS.ReadableStream; /* emitted when a new value is 'put' */ on(event: 'put', cb: (key: any, value: any) => void): this; /* emitted when a value is deleted */ on(event: 'del', cb: (key: any) => void): this; /* emitted when a batch operation has executed */ on(event: 'batch', cb: (ary: any[]) => void): this; /* emitted when clear is called */ on(event: 'clear', cb: (opts: any) => void): this; /* emitted on given event */ on(event: 'open' | 'ready' | 'closed' | 'opening' | 'closing', cb: () => void): this; } ```
AbstractLevelDOWN ```ts export interface AbstractLevelDOWN extends AbstractOptions { open(cb: ErrorCallback): void; open(options: AbstractOpenOptions, cb: ErrorCallback): void; close(cb: ErrorCallback): void; get(key: K, cb: ErrorValueCallback): void; get(key: K, options: AbstractGetOptions, cb: ErrorValueCallback): void; put(key: K, value: V, cb: ErrorCallback): void; put(key: K, value: V, options: AbstractOptions, cb: ErrorCallback): void; del(key: K, cb: ErrorCallback): void; del(key: K, options: AbstractOptions, cb: ErrorCallback): void; batch(): AbstractChainedBatch; batch(array: ReadonlyArray>, cb: ErrorCallback): AbstractChainedBatch; batch( array: ReadonlyArray>, options: AbstractOptions, cb: ErrorCallback, ): AbstractChainedBatch; iterator(options?: AbstractIteratorOptions): AbstractIterator; } ```

Seems like if we want to hook into the events, we would at the very least need the LevelUp interface as that's what exposes the on handlers and extends the EventEmitter interface.

CMCDragonkai commented 2 years ago

The level-hookdown only supports put, del, and batch hooks, but not get hooks. From this alone, it wouldn't be sufficient to implement encryption/decryption using the hooks.

Need to test if the hooks are allowed to mutate the operation records, if that's possible, maybe one can do encryption with a fork.

These types I worked out for the level-hookdown since it doesn't have a types package:

type Callback<P extends Array<any> = [], R = any, E extends Error = Error> = {
  (e: E, ...params: Partial<P>): R;
  (e?: null | undefined, ...params: P): R;
};

type HookOp<K = any, V = any> = {
  type: 'put';
  key: K;
  value: V;
  opts: AbstractOptions;
} | {
  type: 'del';
  key: K;
  opts: AbstractOptions;
} | {
  type: 'batch',
  array: Array<HookOp<K, V>>;
  opts: AbstractOptions;
};

interface LevelDBHooked<K = any, V = any> extends LevelDB<K, V> {
  prehooks: Array<(op: HookOp<K, V>, cb: Callback) => void>;
  posthooks: Array<(op: HookOp<K, V>, cb: Callback) => void>;
}
CMCDragonkai commented 2 years ago

The hook system seems similar to providing a generic way of layering on leveldb. But the awesome list https://github.com/Level/awesome#layers indicates that the preferred way of extending leveldb is to:

Modules that implement abstract-leveldown to wrap another abstract-leveldown. This is the preferred extension point.

If that's the case, then perhaps it's better to just do with abstract-leveldown wrapper instead, then we don't have to re-invent the wheel regarding eventemitter in order to hook in indexing then encryption on writes, and decryption on reads.


The hook system is incapable of mutating the operations. Therefore the conclusions is that the hookdown is not useful for doing any kind of encryption/decryption.

  const hookdb = hookdown(db) as LevelDBHooked;

  const prehook1 = (op: HookOp, cb: Callback) => {
    console.log('pre1', op);
    if (op.type == 'put') {
      op.key = Buffer.from('changed');
      op.value = Buffer.from('changed');
    }
    cb();
  };

  hookdb.prehooks.push(prehook1);

  await db.put(Buffer.from('beep'), Buffer.from('boop'));

  console.log(await db.get('changed')); // NotFoundError: Key not found in database [changed]

But we already knew this because of the lack of get hook. In fact it seems that such a hook system could be replaced by levelup's eventemitter if you didn't need any sort of pre-hooks. Note however that post-hooks would mean index updates are not concurrent-consistent whereas hooks would be, due to the use of cb. Therefore there is no utility in re-implementing such hook system with the event emitter interface. A more robust solution would be using abstract-leveldown layer wrapper style.

Therefore if we wanted to reuse level-auto-index or level-idx, we have to move our encryption layer lower and be embedded within the layers in leveldb below any usage of the index hooks. Moving this would be consistent with the overall theme of moving encryption/decryption lower in our DB stack as discussed in #5, however this would still be above leveldown/leveldb, and thus now between leveldb and the filesystem, so there's still a limitation on the indexing flexibility.

CMCDragonkai commented 2 years ago

The level-hookdown doesn't use levelup's event system at all. It only expects a abstract-leveldown compliant store and then wraps the leveldown methods to enable hooks.

I believe the eventemitter of levelup only works for emitting events post-action. See https://github.com/Level/levelup#events it shows that the events only correspond to after the action has occurred. So the event emitter isn't sufficient for maintaining the ability to do any kind of encryption.

The hookdown doesn't follow the same structure as other leveldown layers: https://github.com/Level/awesome#layers.

See an example of this here: https://github.com/adorsys/encrypt-down/blob/master/src/index.js and https://github.com/Level/encoding-down/blob/master/index.js. They all make use of abstract-leveldown as the driving system for extending leveldown systems.

One thing to note is that we are passing the LevelDB instance in, but the return type of these wrappers may not actually be LevelDB. For example in subleveldown, the return type of the sub function is LevelUp while many other layers return an extended AbstractLevelDOWN like in https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/types/encoding-down/index.d.ts#L23 with:

interface EncodingDown<K = any, V = any> extends AbstractLevelDOWN<K, V> {

Subleveldown can do this because LevelUp is already a leveldown instance too, so the types are all compatible. And subleveldown is alot more complicated since they actually do things like use https://github.com/vweevers/reachdown to acquire the lower level db.

CMCDragonkai commented 2 years ago

Going down the rabbit hole of 1. Embedding into LevelDB Extension Points (hooks, events, abstract-leveldown wrappers)

For indexing, levelup's eventemitter interface is enough to implement it, just like using hookdown. However both are not sufficient to implement encryption (because you need to do something pre-get for decryption, and pre-write for encryption), encryption would require an abstract-leveldown wrapper like encoding-down or encrypt-down.

If one were to use hookdown or eventemitter for indexing, this would only work if encryption is moved to something like encrypt-down. Because these would be ontop of all the layers (although I'm not sure about this for hookdown). It seems layer ordering is important here and one should really be using reachdown to do encryption/decryption.

To simplify things, suppose we ignore the eventemitter and hookdown systems...

It is also possible that indexing and encryption/decryption are both implemented as abstract-leveldown layers. This would have strict ordering requirements with indexing wrapping an encrypted layer. If the encrypted layer isn't using reachdown, then it is essential the db instance is using a binary encoding down which is what we are already doing.

If encryption/decryption where to use reachdown, then may sit below encoding-down, which would mean an order like this:

indexing layer -> encoding layer -> encryption layer -> leveldown

The encryption layer would only work against strings or buffers as this is the expectation of leveldown, and rely on encoding layer to encode anything. Would indexing works prior to encoding so that you can access to the raw types, and also use encoding down when storing the indexes?

Anyway we need to do some more prototyping with the existing autoindex.

  1. More prototyping with indexing plus using subleveldown for multiple values
  2. Prototype with an abstract-leveldown layer

Reference these for prototyping:

CMCDragonkai commented 2 years ago

I'm leaning towards 2. Building on top of DB instead.

This is because doing 1. Embedding into LevelDB Extension Points (hooks, events, abstract-leveldown wrappers) we would have to move the encryption layer, and the existing indexing libraries may not be sufficient for our usecase necessitating a derivative implementation of indexing anyway. This is 3 portions of work:

  1. a derivative of encrypt-down that has to work with webworkers (in the way we have implemented it here)
  2. a derivative of indexing and hooks to support the kind of indexing we want
  3. fixing up the js-db and we would be cognizant of #3

This would be fine except that at the end of the day we still didn't solve #5.

The amount of work above would be shared with doing 2. Building on top of DB, in that:

  1. a derivative of indexing and hooks to support the kind of indexing we want but on top of existing DB
  2. fixing up the js-db and we would be cognizant of #3

And doesn't have the wasted work of "a derivative of encrypt-down that has to work with webworkers (in the way we have implemented it here)" that would be replaced by #5.

It may also be faster because we understand our own code better than how abstract-leveldown prototype works. And doubling down on leveldb peculiarities may not be important if we later want to use sqlite3 or rocksdb.

Going down the 2. route, there are additional reasons why a derivative indexing impl from level-idx, level-auto-index is required:

  1. using index requires properties that TS claims doesn't exist
  2. does not support promises
  3. does not work with valueEncoding: 'binary' because buffers cannot be interrogated for indexing
  4. does not work with existing encryption, as our encrypted data could not be interrogated for indexing (original reason)
  5. it uses an older version of subleveldown thus not even sharing the same implementation of subleveldown (at least for level-idx, as level-auto-index doesn't use sub)
  6. it doesn't support non-unique indexes, thus it just expects that each index would return you the relevant value
CMCDragonkai commented 2 years ago

With respect to creating dynamic sublevels. The prefix names must not have ! as this conflicts with its usage as the separator. This would result in an error like this: RangeError: Prefix must sort after 34 by subleveldown. While the prefixes cannot have !, the keys may have ! in them.

So far none of our static and dynamic sublevel creations would use !, but for autoindexing, if the value we are indexing contains a ! and this is used to create a sublevel, then this can cause a problem. If the values we are indexing directly becomes keys instead of sublevel prefixes, this is not a problem. But we want to use sublevel prefixes in order to gain an ordering of non-unique indexes.

Additionally, in order to represent an index that can point to multiple keys, a multi-value index that is, we can make use of sublevels again. Basically every time we index a value, that value doesn't directly point to the primary key, but instead to another sublevel of primary keys. The values of these can be arbitrary or contain metadata relating to each record. But one can now use the order of the primary key so you can get the "latest" value for a given index.

The level-idx uses an older version of subleveldown which didn't complain about this, so it was possible to create sublevel indexes with Ti!!tle with no error. But this might produce silent problems if the latest version is not preventing the usage of ! in their prefix names.

I wonder if it's possible given that the keys are all binary encoded, to use '\x00', this should be sorted ahead of anything according to subleveldown. Our values would have to disallow the usage of '\x00' which would alot easier compared to disallowing the usage of !.

A foolproof way would be base-encoding the values being indexed to ensure that ! won't appear as a prefix name depending on the base encoding. Both base64 and base58btc doesn't have ! in its alphabet. And base encoding would also eliminate any '\x00' too. This would be done when we intend to hash the values anyway but it should be explicitly marked as this is necessary for indexing even if you don't hash.

If base-encoding is needed regardless of hashing, then the chosen base algorithm should preserve lexicographic order. Otherwise the ordering would be corrupted before hashing occurs. And the reason to not-hash was to preserve the lexicographic-order. The base64 is not lexicographic-preserving, that's why there's this library: https://github.com/deanlandolt/base64-lex. Not sure about base58btc, and maybe there's an encoding in the multibase encoding codecs that does preserve lexicographic order? I believe encoding to hex would be lexicographic-preserving because that's how UUID works too and hex is just 0-9 A-F.

I've asked which ones are lexicographic order in https://github.com/multiformats/js-multiformats/issues/124

We can make a simple decision like this:

This led to an issue in js-id: https://github.com/MatrixAI/js-id/issues/7, fixing the bugs in this issue resulted in experiments demonstrating that base58btc preserved lexicographic order. ~So it should be safe to use base58btc as the base encoding in case we are going to use as a prefix name in subleveldown.~


Further testing indicates that base58btc does not preserve lexicographic order.

The list of encodings that do preserve order is:

base2 - no error
base8 - no error
base16 - no error
base16upper - no error
base32hex - no error
base32hexupper - no error
base32hexpad - no error
base32hexpadupper  - no error

Furthermore the JS binary strings also preserved order.

We should update the js-id tests.

This means we will need to use base32hex when we are doing base encodings for subleveldown and not intending for hashing.

The base32hex is also defined by spec to preserve the order.

We won't need padding because we aren't concatenating encoded ids here, and there's structure around it.

Here is the empirical test demonstrating this:

```ts import type { Codec } from 'multiformats/bases/base'; import crypto from 'crypto'; import { bases } from 'multiformats/basics'; function randomBytes(size: number): Uint8Array { return crypto.randomBytes(size); } type MultibaseFormats = keyof typeof bases; const basesByPrefix: Record> = {}; for (const k in bases) { const codec = bases[k]; basesByPrefix[codec.prefix] = codec; } function toMultibase(id: Uint8Array, format: MultibaseFormats): string { const codec = bases[format]; return codec.encode(id); } function fromMultibase(idString: string): Uint8Array | undefined { const prefix = idString[0]; const codec = basesByPrefix[prefix]; if (codec == null) { return; } const buffer = codec.decode(idString); return buffer; } const originalList: Array = []; const total = 100000; let count = total; while (count) { originalList.push(randomBytes(36)); count--; } originalList.sort(Buffer.compare); const encodedList = originalList.map( (bs) => toMultibase(bs, 'base58btc') ); const encodedList_ = encodedList.slice(); encodedList_.sort(); // encodedList is the same order as originalList // if base58btc preserves lexicographic-order // then encodedList_ would be the same order for (let i = 0; i < total; i++) { if (encodedList[i] !== encodedList_[i]) { console.log('Does not match on:', i); console.log('original order', encodedList[i]); console.log('encoded order', encodedList_[i]); break; } } const decodedList = encodedList.map(fromMultibase); for (let i = 0; i < total; i++) { // @ts-ignore if (!originalList[i].equals(Buffer.from(decodedList[i]))) { console.log('bug in the code'); break; } } ```
CMCDragonkai commented 1 year ago

Closing this as won't fix because our usage of PK has meant we actually index the DB in a number of ways not conducive to automatic indexing at this point in time. In fact we would need a structured schema first because any kind of automatic schema can take place.