Adding encryption capabilities to incremental-indexeddb-adapter.js

acdcjunior commented 3 years ago

Hey, there! I've started using WatermelonDB (in the browser) a couple of months ago and this project I'm in requires the data to be encrypted before storing in IndexedDB.

After a quick research, I found out LokiJS had an encryption-capable adapter. The joy of finding it was short-lived as I also discovered the adapter watermelon actually uses is the incremental-indexeddb-adapter.js@Nozbe/LokiJS which includes some quite important tweaks.

So, cutting to the chase, I have implemented the encryption option in the incremental-indexeddb-adapter.js.

I'm opening this issue here (Nozbe/LokiJS has its issues disabled and this repo seemed like a better fit than the original LokiJS) to actually ask if you were interested in a PR containing the changes I made (which were meant to be backward compatible). If so, perhaps you'd want to quickly discuss some key points about the code. For instance: what are the target browser versions? The use of var made me believe the code was meant for old browsers, but then I found some includes() which has a poor old-browser support. Also If more refactorings or cleanups could be done or else.

Or we can also just leave this "new" adapter alone and keep it as a community alternative (if someone else eventually has the same requirements and wish to try it out)! All good either way!

radex commented 3 years ago

Hey! That sounds very interesting! Indeed, there is a Nozbe/LokiJS repo - if you look closely, you'll notice that actually all of my improvements have been upstreamed to main LokiJS repo, but we still refer to Nozbe/LokiJS, because of some >>>magic<<< in the build system to reduce the size of the dependency.

Anyway, I don't think you maintaining a separate adapter makes sense - the original LokiJS IndexedDB adapter is very much inferior to the improved one, and tweaks keep happening ~twice a year.

I'm not sure I'm a fan of adding encryption capabilities directly into IncrementalIndexedDBAdapter - because of references to fs, crypto, etc.

I think the best way of doing this is to add a hook to IncrementalIndexedDBAdapter that user can configure to add encryption/decryption step. And it would be nice if there was an example encryption/decryption implementation upstreamed in WatermelonDB, so it's easy for others to adopt that. In fact, maybe these: https://github.com/Nozbe/LokiJS/blob/5f3312e424e01d23e8e1af1be318f2c610df5a89/src/incremental-indexeddb-adapter.js#L43-L48 hooks are already good enough. Please let me know!

acdcjunior commented 3 years ago

I do know what you mean about deps and very much agree with it! As a matter of fact, the approach I implemented actually adds two new options functions, much like de/serializeChunk, but that take and return strings instead (also this made testing way easier, because we can take the encryption algorithm "variable" out of the way). So, by all means, the adapter has no concrete dependency to any specific crypto implementation.

I did try to use the de/serializeChunk before attempting any changes to the code at all, that would make things much easier, but unfortunately I couldn't manage to make it work using only them. IIRC, the main reasons were:

encryption should cypher everything, not just the data, which includes from IDB objectStore name to collections and keys (or even the chunk or metadata terms)
when using serializeChunk, JSON.stringify() is still called in the returned value (which already is a string) and though this may be a micro-optimization, considering encryption itself adds a lot of overhead, avoiding these extra JSON steps becomes important

Another aspect is I tried my best to make the code have the smallest possible overhead (e.g. just some ifs here and there) when encryption is disabled, in comparison with the previous version.

Anyway, let me know what's on your mind now. I still have some suggestions about tests and the build step, but this depends on if you actually are interested in that or on how you want to make it happen 👍

radex commented 3 years ago

encryption should cypher everything, not just the data, which includes from IDB objectStore name to collections and keys (or even the chunk or metadata terms)

I'm not sure if this is workable. Well, maybe is it. I guess you'll just have to encrypt key names when doing IDB.put(), and decrypt them during getAll. But I agree this suggests the need to add a different processing step unrelated to serializeChunk.

JSON.stringify() is still called in the returned value (which already is a string)

I don't get that part… you need to stringify structures before passing them off to crypto anyway, no?

acdcjunior commented 3 years ago

I'm not sure if this is workable. Well, maybe is it.

It is! I got it working 99% 😅 Not 100% because I'm still working out an issue that happens during the concurrent writers test. Other than that, it passes all the tests. I also added some of my own using karma/jasmine.

I guess you'll just have to encrypt key names when doing IDB.put(), and decrypt them during getAll.

Yes, something along those lines. But there's more to it: you can't encrypt the .<chunkId> parts of the chunk keys, and you have to make sure you encrypt the string <collectionName>.chunk to always the same value (which is not how typical algos work, since they usually "salt" the encryption process, meaning encryption of a given string will return a different ciphertext every time -- all of which are decrypted to the same original plaintext). These two restrictions happen because the "mega chunks" part use IDB's key ranges (and maybe for other reasons I can't remember atm).

I don't get that part… you need to stringify structures before passing them off to crypto anyway, no?

The thing is that JSON.stringify() is called on the value returned by the encodeChunk. So if your encodeChunk function returns itself a string, this stringification (and the subsequent JSON.parse()) is not really necessary, would you agree?

radex commented 3 years ago

These two restrictions happen because the "mega chunks" part use IDB's key ranges (and maybe for other reasons I can't remember atm).

makes sense. You could always disable the mega chunks optimization, but it is a nice bump in performance, and encryption will surely be costly. You'll have to measure it...

The thing is that JSON.stringify() is called on the value returned by the encodeChunk. So if your encodeChunk function returns itself a string, this stringification (and the subsequent JSON.parse()) is not really necessary, would you agree?

sorry, I'm still confused… it seems natural to me that encryption would happen after stringification, but that it would require a string form already. can you point to exact LOC and explain what the issue is?

acdcjunior commented 3 years ago

encryption will surely be costly. You'll have to measure it...

EDIT: Yeah. From my preliminary measurements, using crypto-js's Rabbit algorithm, IDB storage consumption increased by ~650% [9mb to 68mb after that test file that does fuzz tests and whatnot] and load/save times went up by around ~800%. So, err, definitely not negligible. Naturally, these % depend a lot on the algorithm used co encrypt.

it seems natural to me that encryption would happen after stringification, but that it would require a string form already. can you point to exact LOC and explain what the issue is?

Yes, certainly. In this scenario of using serializeChunk to encrypt, my function would first have to stringify the args (because they are passed as objects) but the issue I mean is because the returned value from my serializeChunk function will be a ciphertext string, the stringification of the serialized chunk (and later parsing) is not necessary and has some (although minor) cost. Makes sense?

_{EDIT[2]: I just realize I have been calling the existing option functions encode/decodeChunk instead of de/serializeChunk! Just fixed it! apologies!}

acdcjunior commented 3 years ago

Ah, another thing, I don't want to create a subthread here, but it seems the serializeChunk/deserializeChunk functions are actually broken. When I use either:

{
    serializeChunk(x) { return x },
    deserializeChunk(x) { return x },
}

or

{
    serializeChunk(x) { return JSON.stringify(x) },
    deserializeChunk(x) { return JSON.parse(x) },
}

as options arg to the adapter, the tests at incrementalidb.html don't pass anymore. Shouldn't they?

radex commented 3 years ago

Yeah. From my preliminary measurements, using crypto-js's Rabbit algorithm, IDB storage consumption increased by ~650% [9mb to 68mb after that test file that does fuzz tests and whatnot] and load/save times went up by around ~800%. So, err, definitely not negligible. Naturally, these % depend a lot on the algorithm used co encrypt.

Oof, that's pretty bad. 8x load/save times may be acceptable for small-ish database, but it's not great.

the 6.5x increase in storage size surprises me. I'm not an expert on encryption, but I believe you should be able to find ciphers that spit out about the same number of bytes as what they ingest, no?

Perhaps this particular algorithm has a large one-time overhead (in both storage and processing time)? If so, you could measure what would happen if you tuned chunk size to something bigger.

Yes, certainly. In this scenario of using serializeChunk to encrypt, my function would first have to stringify the args (because they are passed as objects) but the issue I mean is because the returned value from my serializeChunk function will be a ciphertext string, the stringification of the serialized chunk (and later parsing) is not necessary and has some (although minor) cost. Makes sense?

Ah, of course, we're on the same page. You used a different name for the serialize/desierialize chunk, and that confused me. I think we're in agreement that encrypt/decrypt hook should be separate from serialize/desrialize chunk - happening AFTER stringification.

as options arg to the adapter, the tests at incrementalidb.html don't pass anymore. Shouldn't they?

Ugh. Well, as you can tell, the incrementalidb.html test environment is pretty lame. I'm quite sure serializeChunk works correctly (it's used in production and doesn't explode), but it's possible that it breaks some assumption in the test environment. Worth fixing the test, and explicitly adding a test with serializeChunk used.

acdcjunior commented 3 years ago

the 6.5x increase in storage size surprises me. I'm not an expert on encryption, but I believe you should be able to find ciphers that spit out about the same number of bytes as what they ingest, no?

In these tests I did, I just tried crypto-js's AES and Rabbit out-of-the-box implementations (Rabbit loaded/saved 20% faster). While these don't seem to have many tuning options, I have seen others (e.g. window.crypto) that are much more lower-level and most likely allow you to tweak for optimal processing time or storage consumption.

Anyhow, I think we may postpone benchmarking more algos right now. Keeping the encrypt/decrypt functions as "injected dependencies" keeps this issue [of what cipher to pick] in userland (as it should).

but it's possible that it breaks some assumption in the test environment. Worth fixing the test, and explicitly adding a test with serializeChunk used.

Yeah, I didn't even open the console when the tests failed. I'll have a quick look later on today.

I think we're in agreement that encrypt/decrypt hook should be separate from serialize/desrialize chunk - happening AFTER stringification.

Good! What are the next steps, you'd say?

Another important point: Yesterday I wanted to try window.crypto's algo and realized their encrypt/decrypt functions actually return promises. So while I got the current encrypt/decrypt hooks 100% working (and should ship them in my app the next days), I think an official public version should support promises. The thing here is, although the changes to make this happen are not that complicated, they will require an almost complete refactoring of the whole adapter. In other words, I have managed so far to keep the changes very small and concentrated (which helps reviewing a lot), but this promise refactoring would pretty much affect the whole structure of the file. The good news is the tests should remain fully usable without any changes. This could also be a good opportunity to improve other aspects, such as the build (we could convert the file to TS and use const/let/async/await/arrow functions while keeping the output compatible with older browsers via the transpilation step) or the control flow (change the internal functions that use callbacks to promises as well) and more.

radex commented 3 years ago

Good! What are the next steps, you'd say?

Send a pull request to Nozbe/LokiJS with the required hooks and some tests (tests don't need to have real encryption used if it's inconvenient, some dumb obfuscation/deobfuscation is good enough for tests, just to demonstrate that you can add that step). I won't help you every step of the way, but if you show me real code, then I can give you feedback for what needs improvement.

I'll upstream those changes to official LokiJS later, once tested in WatermelonDB

stale[bot] commented 2 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

acdcjunior commented 2 years ago

A little after my last comment here, I actually completed the changes (added encryption) and have been using it in production since. Unfortunately, though, I lost timing to post a PR. I have an internal TODO to update our fork and when I get around to it I will get back to you with PR-worthy material. I will open a new issue, reopen this, or send a PR to the repo you mentioned, then.

Thanks for your support!

chmiiller commented 2 years ago

Hey Antonio @acdcjunior did you manage to merge your changes? I'm very interested in that :)

acdcjunior commented 2 years ago

@chmiiller I couldn't get around to it, unfortunately. It may need a cleanup to better match the current codebase's patterns (or less if radex approves all changes) or other changes to get up to speed with latest versions. I could push it as it is, if you think that helps. And we can take it from there, wdyt?

EDIT: FWIW I'm still very much using my modified version in production today. :grimacing:

radex commented 2 years ago

@acdcjunior even if it's not mergeable or reviewable, it might help others, so do share ;)

acdcjunior commented 2 years ago

Hey, there. I have pushed the encryption code here, all in one commit: https://github.com/acdcjunior/LokiJS/commit/84073ce6bff4cad2ac3c972dc03d56aae84d804d. I wrote specs, and have been using these changes in production since, but they are based on a year-old version of LokiJS, which is... not optimal.

As for updating it to match the current version... you can see, it changes quite a bit of the src/incremental-indexeddb-adapter.js. I noticed there have been recent changes (e.g. MegaChunking) that basically force me to rewrite all I did (too much conflicts ATM).

It's been almost one year since I did the changes so I don't remember much just by having a quick look. Unfortunately, it is uncertain if I'll have bandwidth to rewrite it any time soon, although I will have to do it sooner or later, but, anyway, there you have it.

EDIT: just one thing, though. API-wise I only added two props, options.encrypt and options.decrypt which, if provided, activate the encryption (it not, it all behaves the same). One important bit is that the way I implemented they don't accept promises as return values, but in a final version they probably should. Some important APIs (e.g. window.crypto) need that -- although the one I used doesn't.

chmiiller commented 2 years ago

thanks, @acdcjunior, this is looking very good!

primus11 commented 8 months ago

Worked further on this, PR being here: https://github.com/Nozbe/LokiJS/pull/5 I know this isn't perfect but might still come handy for someone.

I tried to keep minimal change but sadly couldn't once I tried adding async support (window.crypto). Was it worth it?

Benchmarks below are really large save but not with 10K but 100K (so, take these with grain of salt):

BEFORE CHANGE
saved size: 36869750
saveDatabase: 140.52392578125 ms
loadDatabase: 216.011962890625 ms

AFTER CHANGE
no encryption
saved size: 36869748
saveDatabase: 223.02783203125 ms (+58,72%)
loadDatabase: 257.624755859375 ms (+19,26%)

crypto-js/aes
saved size: 49187440 (+33,41%)
saveDatabase: 2503.418212890625 ms (+2317,85%)
loadDatabase: 2017.5859375 ms (+784,56%)

web-crypto/aes
saved size: 36885795 (+~0%)
saveDatabase: 301.177001953125 ms (+114,33%)
loadDatabase: 389.605224609375 ms (+80,37%)

At one point I also added support for directly writing ArrayObject. As a future note: before this change writes were 4 times slower and reads 10 times (web-crypto/aes) - I am pretty sure that there was spacing for improvement here but don't believe as much as was gained at the end.

Do note that no-encryption is slower at the end so it might be more appropriate to actually separate adapter. I believe that for my use case encryption overhead will be tolerable but might not be for everyone. Also note that this isn't yet battle tested.

Nozbe / WatermelonDB

Adding encryption capabilities to incremental-indexeddb-adapter.js #1096