Integrate Automatic Indexing into DB

CMCDragonkai commented 2 years ago

Specification

Indexing is the act of creating a separate data structure that makes it faster to lookup data on a primary data structure.

PK has several domains that require secondary indexing:

acl - perm id, vault id, node id
gestalts - discovery, adjacency list indexing
nodes - possibly needed in the future for all sorts of optimal routing needs (kadmelia indexing?)
notifications - to be able to filter notifications
sigchain - to be able to filter sigchain
vaults - for vault tagging and vault names

Right now secondary indexing is implemented manually in ACL by persisting bidirectional maps in leveldb.

Indexing is tricky topic, and it's better to create a single indexing mechanism in DB so that all domains in PK can benefit from them.

Generic indexing can reply on LevelDB's eventemitter interface. The good thing is that other people have already created good libraries for this:

https://github.com/hypermodules/level-auto-index - is compatible with subleveldown, and uses "hooks" to abstract the leveldb eventemitter interface
https://github.com/hypermodules/level-idx - this uses level-auto-index

The second library automates some of level-auto-index. In fact it binds secondary index interfaces to the leveldb object.

For @matrixai/db it makes more sense to use level-auto-index so we can more tightly control how the indexes are stored. Most likely via sublevel interfaces.

Additional context

https://github.com/MatrixAI/js-polykey/issues/188 - further details on applying indexing to sigchain operations
http://www.tiernok.com/posts/adding-index-for-a-key-value-store/

Tasks

[x] - Investigate possible dependencies to automate secondary indexing and their relationship
[x] - Prototype the usage of level-auto-index
[ ] - Integrate the level-auto-index into this library's domain/level creation, each level including the root level should have the ability of attaching a secondary index (or multiple secondary indexes)

CMCDragonkai commented 2 years ago

Was thinking that this secondary index is by default unique.

However non-unique keys can be done by making use of the sequence range.

CMCDragonkai commented 2 years ago

The level-idx automates level-auto-index by expecting the usage of a single sublevel as the index database. That sublevel is then further split into sublevels for each index that is created.

So for a structure like this:

ID -> {
  name: string;
  title: string;
}

The ID is the primary index, thus we don't need to index that. The name and title would be 2 separate secondary indexes. These would each occupy a sublevel under an index sublevel.

If we want to apply this to each sublevel we already have with the this.level(domain: string, dbLevel: DBLevel): Promise<DBLevel>, we may have an index sublevel for each sublevel we create, or a global index sublevel for all possible indexes.

The name and title become names of the index sublevels.

If we imagine there are:

aLevel

ID -> {
  name: string;
  content: string;
}

bLevel

ID -> {
  name: string;
  signature: string;
}

You can see that the name index would be conflicting. If we wanted to index the name for aLevel and bLevel, we would need to separate them out.

We could do this by making use of our domain array.

CMCDragonkai commented 2 years ago

Every sublevel created from DB.level can have an associated indexes sublevel. It can be automatically named by taking the domain string and appending domain.index. The . separator has to be different from the prefix separator in order to not confuse it with another sublevel. Perhaps maybe domain-index to avoid any confusion since I often use . to represent separators too.

This indexes sublevel is a sublevel containing all the index sublevels for respective primary db.

Not every sublevel requires an index though, so this can lazily initialised.

On the otherhand, creating a sublevel doesn't actually do anything to the database unless the sublevel is actually used. It can simplify things right now if it is just constructed.

The DB currently does not maintain any of the sublevel references. These are just given to the the user to make use of. See that DB.level returns a DBLevel. It is the user and all the PK domains that currently maintain this. If we keep this same pattern, then the PK domains would have to keep around the indexes. It would the be only way for users to be able to query with respect to an index.

Having the associated indexes sublevel is one thing, it is another to keep around the index itself. The index sublevel is in fact an implementation detail. It would be ideal for the indexes sublevel to not need to be maintained by the user. The user wants to use the index sublevel instead.

So here's a possible API:

const mgrDomain: DBDomain = ['Manager'];
const dirsDomain: DBDomain = [mgrDomain[0], 'dirs'];

const mgrDb = await db.level(mgrDomain[0]);
const dirsDb = await db.level(dirsDomain[1], mgrDb);

// this is is still a Db, but now used as an index
const mgrDbIndex = await db.index(mgrDb, 'indexname', ['name, 'title']);

// remember we can use the db themselves, but we don't benefit from automatic serialisation/deserialisation and encrypt/decrypt

// so if we want to "get" but according to an index, we now have to refer to the index domain
const indexDomain = ['Manager', 'indexname'];

// note that when you are creating a compound index, you now have to use a separator to indicate this
await db.get(indexDomain, 'somename!sometitle');

Remember you don't actually need to create a sublevel to use get, put... etc. The domain parameter basically ends up creating sublevels on the fly.

This means when you first create an index, and then start get and put it should work fine as the docs indicate that it automatically tracks these operations. However what happens if you already have data in the db, and you create the index afterwards? We can enforce the idea that the index should be created ahead of time, but it would be ideal that AutoIndex has a way of reconstructing the index from scratch if there was already data in the primary sublevel.

Another issue is the fact that the DB is now always using buffer encoding. So that means the index dbs should also be always using buffer encoding for values. But keys themselves depend.

A further problem is whether the indexes are going through the encryption/decryption? If they are not using the DB.get/put mechanisms, then the index values are not encrypted.

CMCDragonkai commented 2 years ago

Ok so there's a problem. Because the values are encrypted before storing into the DB. The hooks which relies on the event triggers only have access to the encrypted ciphertext. Thus the usage of level-auto-index is not possible because our encryption/decryption sits on top of the leveldb. And when the autoindex is listening on events, they are getting encrypted values.

Ok so if we want to do this, we would need to synthesise and extract the implementation into our own DB here rather than just using the upstream libraries.

CMCDragonkai commented 2 years ago

One issue with encrypting indexes, is that the index sublevel key is based on the primary sublevel value.

So if the primary sublevel name value is "sam".

Then would this mean that the index sublevel key is "sam"? If so, this would then be leaking the value itself.

If we would to encrypt the index keys this ends up with some problems...

Encryption would have to be deterministic. You are given sam and you must always get back the same ciphertext.
Sort order of the index keys no longer make sense.

Here's an alternative. An order-preserving hash. If the index keys can go through an order-preserving hash, then it should still work nicely preserving order, and you can deterministically hash any value you want to look up and then find the "encrypted" ID and decrypt it. It may even be possible to just hash the ID instead of encrypting it. The values of the index sublevels are just keys of the primary sublevels, and keys of primary sublevels are not encrypted anyway.

It makes sense to use order preserving hash on index keys because:

You don't want to leak the plaintext value of the primary level
You want to preserve the order of the value, so that range keys could be used, and iterating over the index key should be the same as if iterating over the value
Can be applied to compound indexes, by concatenating 2 or more order preserving hashes

Does it make sense to encrypt/decrypt the index values? - No (keep them plaintext)

The index values are just the keys of the primary level, the keys of the primary level are not encrypted
Therefore.. you should not need to encrypt the index values, or hash the index values either

CMCDragonkai commented 2 years ago

What kind of indexes do we want to support?

Unique value indexes, I'm not sure how the index works when multiple entries have the same value, does it figure out that the same entry can point to multiple keys?
Range indexes - need an iterator that counts it
Compound indexes - requires separator...

This gets pretty complicated. The most likely usage right now is just unique value keys, and possibly when it is not unique. If we end up with too many complicated indexes, we might as well change to using Sqlite instead.

Going to prototype this with level-auto-index to see how it deals non-unique value indexes.

CMCDragonkai commented 2 years ago

Order preserving hashing is not useful because you can recover the original value. https://crypto.stackexchange.com/questions/8160/secure-order-preserving-hash-function

I'm not sure if this applies to perfect minimal hashes. But it seems there is some leak.

There are new algorithms for order preserving encryption but no libraries in JS and too costly to implement for this feature.

However if we were to not use indexing, we would probably have to create bidirectional maps and end up not encrypting the key anyway. Therefore having an unencrypted index isn't any worse.

The main thing then is that anything we index is going to be leaked and thus indexing should be limited to values we are willing to leak like ID values. Things that are not meant to be leaked should not be indexed then.

So for now we can go with unencrypted & unhashed index keys. But index values are going to be encrypted as normal. A big fat warning is required to ensure Devs know that any indexed values are going to be plaintext.

The best solution would where the leveldb block system itself had encryption. This would mean all keys and values are encrypted and encryption is happening at the lower level. However this is not feasible to change to atm.

CMCDragonkai commented 2 years ago

It appears there's a relationship between proper transaction support, block level encryption and secure indexing.

https://github.com/MatrixAI/js-encryptedfs/issues/49

In EFS we have an issue for integrating native snapshots into the transaction system. This requires going into the C++ binding of leveldb and making it work. Then our transactions can use leveldb snapshots natively. And in such a case our transaction could scale too.

Block level encryption means encryption at the blocks of the DB rather than at record level. RocksDB has this capability https://github.com/facebook/rocksdb/pull/2424, but would require changing to use rocksdb which is possible because it fits the levelup API, but any encryption mechanism would be done at the C++ level, and not involve webcrypto APIs or primitives.

Indexing as in this issue is currently not entirely secure or is very limited in order to not leak values. It would work better if block encyption was available. Then we would not need to work out complicated OPE.

All 3 problems can be resolved by using sqlite3 and the sqlcipher extension assuming that it does block level encryption. However cross platform requirements need analysis. This also impacts the common crypto mechanisms used by PK.

Whether it is leveldb, rocksdb or sqlite, any robust situation will require calling into C++ and manipulating the C++ or C code. Although I wonder if the sqlitejs ports can use wasm and whether sqlcipher can be put through wasm.

CMCDragonkai commented 2 years ago

For now we will stick with simple indexing, and the current transaction support is sufficient.

When the combination of 3 problems becomes worthwhile, then we should investigate solutions for this DB to have encryption to be at block level, and then having proper transactions and proper indexing support. The first path would likely to be investigating the usage of SQLCipher. It's not an extension, it's a fork. It would need to be usable on PK CLI and PK GUI, and for Android & iOS it looks like it would need to be compiled for those platforms. This would be large epic, as it would involve multiple subissues to resolve. The encryption work would also need to involve consideration of WASM https://github.com/MatrixAI/js-polykey/issues/155 and QUIC https://github.com/MatrixAI/js-polykey/issues/234.

Basically for our next phase of work, there's going to be alot of C/C++ involved.

@joshuakarp

CMCDragonkai commented 2 years ago

Since https://github.com/MatrixAI/js-id/issues/1 is done, this means we now have the ability to use IdDeterministic for doing any kind of hashing for hash based indexing.

CMCDragonkai commented 2 years ago

2 ways we can implement this.

1. Embedding into LevelDB Extension Points (hooks, events, abstract-leveldown wrappers): By trying to bring encryption into the leveldb via hooking into its event system, and thus getting a good idea of how we can apply indexing just before the encryption.

2. Building on top of DB here: By doing it above the existing encryption in our DB class.

The former requires diving into the leveldb internals, the latter might result in some code duplication and a recreation of an event system. The former doubles down on leveldb investment, the latter may be lighterweight and enable us to move to sqlite3 or other systems in the future.

Additionally, in order to represent an index that can point to multiple keys, a multi-value index that is, we can make use of sublevels again. Basically every time we index a value, that value doesn't directly point to the primary key, but instead to another sublevel of primary keys. The values of these can be arbitrary or contain metadata relating to each record. But one can now use the order of the primary key so you can get the "latest" value for a given index.

CMCDragonkai commented 2 years ago

Compound indexing is useful for our Discovery queue to represent priorities. https://github.com/MatrixAI/js-polykey/issues/329#issuecomment-1040944042

CMCDragonkai commented 2 years ago

2 ways we can implement this.

1. Embedding into LevelDB Extension Points (hooks, events, abstract-leveldown wrappers): By trying to bring encryption into the leveldb via hooking into its event system, and thus getting a good idea of how we can apply indexing just before the encryption.

2. Building on top of DB here: By doing it above the existing encryption in our DB class.

The former requires diving into the leveldb internals, the latter might result in some code duplication and a recreation of an event system. The former doubles down on leveldb investment, the latter may be lighterweight and enable us to move to sqlite3 or other systems in the future.

Additionally, in order to represent an index that can point to multiple keys, a multi-value index that is, we can make use of sublevels again. Basically every time we index a value, that value doesn't directly point to the primary key, but instead to another sublevel of primary keys. The values of these can be arbitrary or contain metadata relating to each record. But one can now use the order of the primary key so you can get the "latest" value for a given index.

Way 1. cannot be done unless #4 and #5 are solved first. As these are long and complicated problems, we'll park these until later, it will involve ultimately looking into the C++ codebase of leveldb and making changes there or at the level nodejs wrapper code.

Way 2. will be what we have to do to proceed for now. To be as foolproof as possible, we should wrap our sublevels as DB type as well.

CMCDragonkai commented 2 years ago

In https://github.com/MatrixAI/js-polykey/issues/244 we are storing each node entry in buckets. The buckets are ordered due to the lexi-int encoding of the bucket index, however the node ids themselves require to be indexed by the lastUpdated timestamp. This is to allow us to do fetches and streams in the order of last updated and to avoid having to sort the bucket every time we acquire it.

CMCDragonkai commented 2 years ago

With the merge of #10, this can now use a special root level called index to store all relevant index level data to avoid conflicting with normal data usage.

CMCDragonkai commented 2 years ago

Automatic indexing is made easier with snapshot isolation, because we can expect consistent reads, meaning while traversing an index, we can look up what the index points to easily.

CMCDragonkai commented 2 years ago

Since we're on rocksdb now, we can make use of some prior art for secondary indexing:

CMCDragonkai commented 1 year ago

There's too many ways we use indexing for it to be automated unless we first have a structured schema. So won't need this now since we have already done all the manual indexing work.

MatrixAI / js-db