Closed CMCDragonkai closed 1 year ago
Was thinking that this secondary index is by default unique.
However non-unique keys can be done by making use of the sequence range.
The level-idx
automates level-auto-index
by expecting the usage of a single sublevel as the index database. That sublevel is then further split into sublevels for each index that is created.
So for a structure like this:
ID -> {
name: string;
title: string;
}
The ID
is the primary index, thus we don't need to index that. The name
and title
would be 2 separate secondary indexes. These would each occupy a sublevel under an index sublevel.
If we want to apply this to each sublevel we already have with the this.level(domain: string, dbLevel: DBLevel): Promise<DBLevel>
, we may have an index sublevel for each sublevel we create, or a global index sublevel for all possible indexes.
The name
and title
become names of the index sublevels.
If we imagine there are:
aLevel
ID -> {
name: string;
content: string;
}
bLevel
ID -> {
name: string;
signature: string;
}
You can see that the name
index would be conflicting. If we wanted to index the name
for aLevel
and bLevel
, we would need to separate them out.
We could do this by making use of our domain
array.
Every sublevel created from DB.level
can have an associated indexes sublevel. It can be automatically named by taking the domain
string and appending domain.index
. The .
separator has to be different from the prefix separator in order to not confuse it with another sublevel. Perhaps maybe domain-index
to avoid any confusion since I often use .
to represent separators too.
This indexes sublevel is a sublevel containing all the index sublevels for respective primary db.
Not every sublevel requires an index though, so this can lazily initialised.
On the otherhand, creating a sublevel doesn't actually do anything to the database unless the sublevel is actually used. It can simplify things right now if it is just constructed.
The DB
currently does not maintain any of the sublevel references. These are just given to the the user to make use of. See that DB.level
returns a DBLevel
. It is the user and all the PK domains that currently maintain this. If we keep this same pattern, then the PK domains would have to keep around the indexes. It would the be only way for users to be able to query with respect to an index.
Having the associated indexes sublevel is one thing, it is another to keep around the index itself. The index sublevel is in fact an implementation detail. It would be ideal for the indexes sublevel to not need to be maintained by the user. The user wants to use the index sublevel instead.
So here's a possible API:
const mgrDomain: DBDomain = ['Manager'];
const dirsDomain: DBDomain = [mgrDomain[0], 'dirs'];
const mgrDb = await db.level(mgrDomain[0]);
const dirsDb = await db.level(dirsDomain[1], mgrDb);
// this is is still a Db, but now used as an index
const mgrDbIndex = await db.index(mgrDb, 'indexname', ['name, 'title']);
// remember we can use the db themselves, but we don't benefit from automatic serialisation/deserialisation and encrypt/decrypt
// so if we want to "get" but according to an index, we now have to refer to the index domain
const indexDomain = ['Manager', 'indexname'];
// note that when you are creating a compound index, you now have to use a separator to indicate this
await db.get(indexDomain, 'somename!sometitle');
Remember you don't actually need to create a sublevel to use get
, put
... etc. The domain
parameter basically ends up creating sublevels on the fly.
This means when you first create an index, and then start get
and put
it should work fine as the docs indicate that it automatically tracks these operations. However what happens if you already have data in the db, and you create the index afterwards? We can enforce the idea that the index should be created ahead of time, but it would be ideal that AutoIndex
has a way of reconstructing the index from scratch if there was already data in the primary sublevel.
Another issue is the fact that the DB is now always using buffer encoding. So that means the index dbs should also be always using buffer encoding for values. But keys themselves depend.
A further problem is whether the indexes are going through the encryption/decryption? If they are not using the DB.get/put
mechanisms, then the index values are not encrypted.
Ok so there's a problem. Because the values are encrypted before storing into the DB. The hooks which relies on the event triggers only have access to the encrypted ciphertext. Thus the usage of level-auto-index
is not possible because our encryption/decryption sits on top of the leveldb. And when the autoindex is listening on events, they are getting encrypted values.
Ok so if we want to do this, we would need to synthesise and extract the implementation into our own DB
here rather than just using the upstream libraries.
One issue with encrypting indexes, is that the index sublevel key is based on the primary sublevel value.
So if the primary sublevel name
value is "sam".
Then would this mean that the index sublevel key is "sam"? If so, this would then be leaking the value itself.
If we would to encrypt the index keys this ends up with some problems...
sam
and you must always get back the same ciphertext.Here's an alternative. An order-preserving hash. If the index keys can go through an order-preserving hash, then it should still work nicely preserving order, and you can deterministically hash any value you want to look up and then find the "encrypted" ID and decrypt it. It may even be possible to just hash the ID instead of encrypting it. The values of the index sublevels are just keys of the primary sublevels, and keys of primary sublevels are not encrypted anyway.
It makes sense to use order preserving hash on index keys because:
Does it make sense to encrypt/decrypt the index values? - No (keep them plaintext)
What kind of indexes do we want to support?
This gets pretty complicated. The most likely usage right now is just unique value keys, and possibly when it is not unique. If we end up with too many complicated indexes, we might as well change to using Sqlite instead.
Going to prototype this with level-auto-index to see how it deals non-unique value indexes.
Order preserving hashing is not useful because you can recover the original value. https://crypto.stackexchange.com/questions/8160/secure-order-preserving-hash-function
I'm not sure if this applies to perfect minimal hashes. But it seems there is some leak.
There are new algorithms for order preserving encryption but no libraries in JS and too costly to implement for this feature.
However if we were to not use indexing, we would probably have to create bidirectional maps and end up not encrypting the key anyway. Therefore having an unencrypted index isn't any worse.
The main thing then is that anything we index is going to be leaked and thus indexing should be limited to values we are willing to leak like ID values. Things that are not meant to be leaked should not be indexed then.
So for now we can go with unencrypted & unhashed index keys. But index values are going to be encrypted as normal. A big fat warning is required to ensure Devs know that any indexed values are going to be plaintext.
The best solution would where the leveldb block system itself had encryption. This would mean all keys and values are encrypted and encryption is happening at the lower level. However this is not feasible to change to atm.
It appears there's a relationship between proper transaction support, block level encryption and secure indexing.
In EFS we have an issue for integrating native snapshots into the transaction system. This requires going into the C++ binding of leveldb and making it work. Then our transactions can use leveldb snapshots natively. And in such a case our transaction could scale too.
Block level encryption means encryption at the blocks of the DB rather than at record level. RocksDB has this capability https://github.com/facebook/rocksdb/pull/2424, but would require changing to use rocksdb which is possible because it fits the levelup API, but any encryption mechanism would be done at the C++ level, and not involve webcrypto APIs or primitives.
Indexing as in this issue is currently not entirely secure or is very limited in order to not leak values. It would work better if block encyption was available. Then we would not need to work out complicated OPE.
All 3 problems can be resolved by using sqlite3 and the sqlcipher extension assuming that it does block level encryption. However cross platform requirements need analysis. This also impacts the common crypto mechanisms used by PK.
Whether it is leveldb, rocksdb or sqlite, any robust situation will require calling into C++ and manipulating the C++ or C code. Although I wonder if the sqlitejs ports can use wasm and whether sqlcipher can be put through wasm.
For now we will stick with simple indexing, and the current transaction support is sufficient.
When the combination of 3 problems becomes worthwhile, then we should investigate solutions for this DB to have encryption to be at block level, and then having proper transactions and proper indexing support. The first path would likely to be investigating the usage of SQLCipher. It's not an extension, it's a fork. It would need to be usable on PK CLI and PK GUI, and for Android & iOS it looks like it would need to be compiled for those platforms. This would be large epic, as it would involve multiple subissues to resolve. The encryption work would also need to involve consideration of WASM https://github.com/MatrixAI/js-polykey/issues/155 and QUIC https://github.com/MatrixAI/js-polykey/issues/234.
Basically for our next phase of work, there's going to be alot of C/C++ involved.
@joshuakarp
Since https://github.com/MatrixAI/js-id/issues/1 is done, this means we now have the ability to use IdDeterministic
for doing any kind of hashing for hash based indexing.
2 ways we can implement this.
1. Embedding into LevelDB Extension Points (hooks, events, abstract-leveldown wrappers): By trying to bring encryption into the leveldb via hooking into its event system, and thus getting a good idea of how we can apply indexing just before the encryption.
2. Building on top of DB here: By doing it above the existing encryption in our DB
class.
The former requires diving into the leveldb internals, the latter might result in some code duplication and a recreation of an event system. The former doubles down on leveldb investment, the latter may be lighterweight and enable us to move to sqlite3 or other systems in the future.
Additionally, in order to represent an index that can point to multiple keys, a multi-value index that is, we can make use of sublevels again. Basically every time we index a value, that value doesn't directly point to the primary key, but instead to another sublevel of primary keys. The values of these can be arbitrary or contain metadata relating to each record. But one can now use the order of the primary key so you can get the "latest" value for a given index.
Compound indexing is useful for our Discovery queue to represent priorities. https://github.com/MatrixAI/js-polykey/issues/329#issuecomment-1040944042
2 ways we can implement this.
1. Embedding into LevelDB Extension Points (hooks, events, abstract-leveldown wrappers): By trying to bring encryption into the leveldb via hooking into its event system, and thus getting a good idea of how we can apply indexing just before the encryption.
2. Building on top of DB here: By doing it above the existing encryption in our
DB
class.The former requires diving into the leveldb internals, the latter might result in some code duplication and a recreation of an event system. The former doubles down on leveldb investment, the latter may be lighterweight and enable us to move to sqlite3 or other systems in the future.
Additionally, in order to represent an index that can point to multiple keys, a multi-value index that is, we can make use of sublevels again. Basically every time we index a value, that value doesn't directly point to the primary key, but instead to another sublevel of primary keys. The values of these can be arbitrary or contain metadata relating to each record. But one can now use the order of the primary key so you can get the "latest" value for a given index.
Way 1. cannot be done unless #4 and #5 are solved first. As these are long and complicated problems, we'll park these until later, it will involve ultimately looking into the C++ codebase of leveldb and making changes there or at the level nodejs wrapper code.
Way 2. will be what we have to do to proceed for now. To be as foolproof as possible, we should wrap our sublevels as DB
type as well.
In https://github.com/MatrixAI/js-polykey/issues/244 we are storing each node entry in buckets. The buckets are ordered due to the lexi-int encoding of the bucket index, however the node ids themselves require to be indexed by the lastUpdated
timestamp. This is to allow us to do fetches and streams in the order of last updated and to avoid having to sort the bucket every time we acquire it.
With the merge of #10, this can now use a special root level called index
to store all relevant index level data to avoid conflicting with normal data usage.
Automatic indexing is made easier with snapshot isolation, because we can expect consistent reads, meaning while traversing an index, we can look up what the index points to easily.
Since we're on rocksdb now, we can make use of some prior art for secondary indexing:
There's too many ways we use indexing for it to be automated unless we first have a structured schema. So won't need this now since we have already done all the manual indexing work.
Specification
Indexing is the act of creating a separate data structure that makes it faster to lookup data on a primary data structure.
PK has several domains that require secondary indexing:
Right now secondary indexing is implemented manually in ACL by persisting bidirectional maps in leveldb.
Indexing is tricky topic, and it's better to create a single indexing mechanism in DB so that all domains in PK can benefit from them.
Generic indexing can reply on LevelDB's eventemitter interface. The good thing is that other people have already created good libraries for this:
The second library automates some of
level-auto-index
. In fact it binds secondary index interfaces to the leveldb object.For
@matrixai/db
it makes more sense to uselevel-auto-index
so we can more tightly control how the indexes are stored. Most likely via sublevel interfaces.Additional context
Tasks
level-auto-index
[ ] - Integrate thelevel-auto-index
into this library's domain/level creation, each level including the root level should have the ability of attaching a secondary index (or multiple secondary indexes)