MatrixAI / js-db

Key-Value DB for TypeScript and JavaScript Applications
https://polykey.com
Apache License 2.0
5 stars 0 forks source link

Transparent Encryption/Decryption Layer between LevelDB and Filesystem "Block Manager" #5

Open CMCDragonkai opened 2 years ago

CMCDragonkai commented 2 years ago

Specification

Our current encryption/decryption layer sits on top of LevelDB. This causes problems for indexing #1 because when you want ot index something you'll need to expose keys, and keys have to be un-encrypted atm.

It may also increase performance of DB if encryption/decryption were operating at a block level rather at individual key-value level. It's the equivalent of using full-disk encryption and using leveldb on top.

We can't rely on OS provided full-disk encryption. So something that is in-between the current key-value DB like leveldb and the actual filesystem that is executed in JS or C++ would be needed.

There is a level-js which is a abstract-leveldown compliant store that can be wrapped in levelup. It is leveldb implemented in pure-JS which relies on IndexedDB. Currently IndexedDB doesn't exist natively on Node.js, but there are some implementations of it. This seems to give an opportunity to add a transparent encryption/decryption layer in between leveldb and IndexedDB.

Additional context

Tasks

  1. [ ] - Investigate how level-js uses IndexedDB
  2. [ ] - Attempt to implement or find a persistent IndexedDB, perhaps by being implemented by leveldb or sqlite, it seems like any performant implementation would have to use C++ at some point, also there are bunch of wrapper libraries, but not sure which ones actually perform real persistence
  3. [ ] - Integrate this into PK
CMCDragonkai commented 2 years ago

Alternative is to instead work on leveldb directly and manipulate the C++ to allow one to plug in encryption/decryption.

CMCDragonkai commented 2 years ago

Along with IndexedDB, RocksDB, another option is lmdb based on the discussion here: https://github.com/MatrixAI/js-db/commit/6107d9c3ac55767113034bcedd19b379a5181a1d#commitcomment-73269633.

The lmdb-js project already supports native encryption at the block level thus ensuring keys and values are encrypted.

CMCDragonkai commented 2 years ago

Since we are working at the C++ layer, this should mean we can finally attempt block level encryption. I wonder if we can just bind to node's openssl. https://github.com/nodejs/node-gyp/blob/master/docs/Linking-to-OpenSSL.md since it's already there.

Node's crypto and webcrypto API is likely built on top of the statically linked openssl. If we do the same, we would maintain parity with the crypto implementation. And it avoids bringing our own crypto library. Finally if we have to do, we can then do so with a native library rather needing it to be implemented in raw JS or web assembly.

CMCDragonkai commented 2 years ago

Relevant PRs:

CMCDragonkai commented 1 year ago

When doing this it's worth considering the ability to do incremental key rotation.

This means if the key gets changed instead of re-encrypting EVERYTHING straight away, we can encrypt new values with the new key.

However the old key would have to be kept around to decrypt old values and can only be discarded once all old values are gone and have been re-encrypted.

We can one of 2 ways:

  1. Background incremental re-encryption
  2. Pull-driven incremental re-encryption - that is re-encryption only occurs for values that have been read or written to.

One could build 1 off 2. A background system can just read every single block. While in the case of 2, it just means a reference count has to be kept around for the key.

However js-db doesn't keep around the key on disk. It is expected that one key is provided to the DB in-memory. The persistence of old keys will need to be hooked into through a ref counted system.

How do we identify blocks that are encrypted with a particular key... we may hash the key, and keep the hash around as a "key identifier". Then each block would have a key identifier.

Blocks would need to be large enough to justify keeping these key identifiers around. I imagine we may have something like 16 bit hashes or 8 bit hashes.

Perhaps a counter could also work, but one would need to again remember some aspect of the key that is being used.

Perhaps the db can remember there are X keys still be used. Imagine:

Key1 - 13 blocks
Key2 - 20 blocks
... etc

Then the user must provide those 6 keys again. If they don't, then the initial integrity/canary check will fail.

CMCDragonkai commented 1 year ago

When integrating our new symmetric crypto routines from sodium native to js-db, we need to consider how to integrate 2 native shared objects (native plugins) to nodeJS together.

I asked ChatGPT about this https://chat.openai.com/share/d09826e1-ebb0-4584-9e89-d379ac7363b8.

This will also be relevant to https://github.com/MatrixAI/Polykey/issues/526.

The key point is to avoid code duplication. We won't want to use the OpenSSL library inside NodeJS, because OpenSSL there is not likely to exist on other platforms, so we must supply our own crypto library which currently is the libsodium provided by sodium-native package (which we most likely need to fork into PK).

CMCDragonkai commented 1 year ago

Also see the discussion in https://github.com/MatrixAI/Polykey/issues/526#issuecomment-1637069019 for further elaboration on interactions between different shared object native libraries in the same NodeJS process.

CMCDragonkai commented 1 year ago

It seems then, that the right thing to do is to require peer dependencies, rather than direct dependencies.

That is, the DB could depend on the peer dependency on sodium-native. This sort of implies that sodium-native is the host, and @matrixai/db is the plugin.

Thus requiring that the downstream project have sodium-native as a dependency as well. It's bit inconvenient, but it would ensures that encryption is necessary.

It's a bit strange. Alternatively if @matrixai/db were to say that sodium-native is direct dependency, then it can still work without problems as long as the downstream packages didn't bring a different incompatible version of sodium-native.

Given that @matrixai/db is already a native package, it's not really a big deal to add a dependency on another native package.

On top of this, one could argue that it's an optional dependency, because the DB doesn't actually need to have crypto switched on. Right now it's a dependency injection. However we still need to work out how exactly one would dependency inject into the RocksDB environment...? Especially since we would want to avoid having C++ code call JS then call C++, instead C++ should just call C++ directly. So I imagine this would have to be just a runtime boolean switch to turn it on/off.

And thus it would be hardcoded to the sodium-native crypto facilities. No dependency injection possible here. I think though, there is this concept of calling a common interface/header, and being able to substitute for a different library as long as it exposed the same symbols. I see some native projects saying that you can swap out their SSL for different openssl variants. So this must be possible too. Therefore this would be a libsodium based interface.

CMCDragonkai commented 1 year ago

So upon further research, I see that it's possible to "dynamically" inject the function pointer into the C/C++ code. This is different technique to just using the same headers, and then using -l shared.so when compiling, because this relies on the dlsym function to resolve to a particular function.

So imagine that in the C++ code, we wanted to have functions passed in that we would call to do the crypto operation. These would be considered C function pointers. How would we "pass" these in from JS.

Well you could do something like this:

#include <dlfcn.h>

int main() {
    void* handle = dlopen("mylib.so", RTLD_LAZY);
    void (*function_in_library)() = dlsym(handle, "function_in_library");
    function_in_library();  // Indirect call through a function pointer
    dlclose(handle);
    return 0;
}

Suppose this was called by NodeJS:

    void* handle = dlopen("mylib.so", RTLD_LAZY);

I'm not sure if it is possible to access the void* handle returned by dlopen just by doing require() in JS (or in the case of ESM, the import()), but suppose you got that some how, you would be able to move that around like an opaque reference pointer via the NAPI interface.

Then subsequently pass that into the C++ side of @matrixai/db.

Then the @matrixai/db could do:

    void (*function_in_library)() = dlsym(handle, "function_in_library");
    function_in_library();  // Indirect call through a function pointer

The handle still is managed by the caller though. If it wants to use dlclose.

I'm not sure if this is a better method. This sort of allows @matrixai/db to be agnostic to the crypto implementation, and just require someone to pass in the right C function that supports a particular simple interface.

CMCDragonkai commented 1 year ago

There is a slight performance penalty on using the dlsym method. But it's actually quite flexible, because we defer the linking decision. Direct calls require you to use the -l option during compilation/linking to the shared object, so you have to have it available at the point which we are compiling @matrixai/db.

CMCDragonkai commented 1 year ago

Note the usage of https://nodejs.org/api/process.html#processdlopenmodule-filename-flags, primarily is about loading exported NAPI/node API functions. But if the shared object just has exposed symbols in general... that should be available to other shared objects right? This requires some experimentation and comparison to ESM async imports/static imports.

CMCDragonkai commented 1 year ago

Oh actually it turns out once you switch to ESM, you cannot use require. But you can use process.dlopen.

image

CMCDragonkai commented 1 year ago

Test with different symbols: https://nodejs.org/api/os.html#dlopen-constants