MatrixAI / js-encryptedfs

Encrypted Filesystem for TypeScript/JavaScript Applications
https://polykey.com
Apache License 2.0
10 stars 3 forks source link

Identify and Eliminate Unscalable Operations with Lazy GC #53

Open CMCDragonkai opened 2 years ago

CMCDragonkai commented 2 years ago

Specification

Some operations are "unscalable". For example to delete a file, one has to iterate over the entire file blocks and batch up all the delete operations and then execute them. This is due to the structure of how we are storing each file block as a separate key value entry in a sublevel.

If the file is large, this can take a long time. Furthermore deletion of files should be a lot faster than this. Real filesystems do not need to read the entire file just to delete it.

In order to solve this problem one way is to use lazy garbage collection. When deleting a file, it is marked as no longer accessible. The is already possible since file accessibility is determined by hardlinks which are persisted in the directory inodes.

However what is going to actually delete the key value entries? Real filesystems can simply mark filespace as reusable. Unfortunately this is not possible with a key value structure, and it would just get quite complex.

Therefore, we may instead of a garbage collection process that runs as a stream/loop and quietly deletes un-used entries in the background continuously as the EFS is being used. Right now we have something similar to this in order to remove zombie inodes in our gc sublevel.

Recommend researching the garbage collection techniques like in https://github.com/kenfox/gc-viz and seeing how we can apply it.

Deleting a file isn't the only cause where this can occur. I'm sure there may be other operations that are unscalable. Identify all of these in this issue.

Additional context

Tasks

  1. [ ] - Identify all unscalable operations and whether they apply to this situation
  2. [ ] - Research the garbage collection techniques above
  3. [ ] - Integrate the technique above in a lazy way so that it doesn't hog the CPU nor does it block other operations, because JS async contexts cannot be interrupted, then they must be yielded, the implementation must be as asynchronous as possible and coroutine/cooperative multitasking friendly, perhaps by the use of setTimeout yields (implemented as sleep promises).
  4. [ ] - Consider if it would be possible to multi-thread the leveldb usage in worker threads and end up with parallel lazy GC, note that leveldb currently is not "multi-process" safe, but there is a multilevel adapter for multiprocesses, experiment with worker threads as it is using threads and not processes
CMCDragonkai commented 2 years ago

https://github.com/MatrixAI/js-encryptedfs/pull/63#issuecomment-1102145235

Currently ftruncate is unscalable because it reads in all the file blocks before deleting.

With our transactional iteration, we should be able to identify the block index we need to truncate to, and then do a more precise deletion of the blocks above, and the overwrite of that specific block that needs to be done. Will address this after all bugs are solved.

CMCDragonkai commented 2 years ago

I want to note that our write buffer is placed in the DB (disk) itself and is therefore unbounded by memory. However our write operations are buffered in-memory and not in the DB. You might wonder why not push all write operations also into the DB transaction data path? Well because at the end you still end up reading them all in-memory to perform the write batch, so there's no advantage with putting the operations onto disk. Therefore no matter what our transaction sizes are still bounded by memory.

Basically remember that our DB will always have transactions bounded by memory. So if there's an atomic operation that is larger than main memory, it will cause a problem.

Any solution has to avoid buffering too many operations into memory. Which is why I'm mentioning GC in this issue. If you get GC involved, you can "stream" mutate the DB, and thus side-effects occurs lazily (or in batches).

The best solution is that the underlying DB eventually has a disk-based write batch. But even in rocksdb, write batches are limited by main memory: https://github.com/facebook/rocksdb/issues/5938.

So yea, eventually the design has to be capable of doing things lazily or in small batches while preserving consistency.