fedimint / fedimint

Federated E-Cash Mint
https://fedimint.org/
MIT License
536 stars 209 forks source link

Partition Database so Modules can re-use key prefixes #1265

Closed m1sterc001guy closed 1 year ago

m1sterc001guy commented 1 year ago

We don't currently have a great way of iterating through all of the keys in the database (i.e SELECT * in SQL land). This is because different parts of the code have reserved key prefixes, so to iterate through all the keys we need to know about all of the DbKeyPrefix enums in different parts of the code. This is what dbdump does currently.

I want to iterate through all the keys so that I can implement the db migrations by copying all the key/value pairs to a new db (we can also support an in-place upgrade too, but making a new db is safer).

One way to do this would be to refactor all of the prefixes into a giant enum and simply iterate through that. This is not great for modularization though since it requires all modules to reserve a key prefix space within the database.

A better solution is to use RocksDB's column families. We can partition the database for each module (each module can be a different column family). That means that key prefixes can then overlap between modules and we will still be able to read the key/value pairs properly from the database.

For the client, if its using SQLite instead of RocksDB, we can use different tables instead of Column Families.

dpc commented 1 year ago

I was thinking that we should put modules data in a pre-prefix.

So we would hava a global u8 enum for everything, where one thing would be:

#[repr(u8)]
enum Prefix {
  //...
  Module,
  // ...
}

and then database would handle it somehow to prefix all operations from a module with Prefix::Module (1 bytes), ModuleKey (soon: ModuleInstanceId) (2 bytes) both on reads and writes.

This would isolate the modules from each other and put all the module data behind a common prefix.

We might create a wrapper struct ModuleDatabaseHandle and inject that instead of Database or DatabaseTransaction to modules. Internally it would just add these 3 bytes everywhere, so the module can't touch anything outside of its range.

m1sterc001guy commented 1 year ago

If we create a pre-prefix, are we concerned about performance at all? This essentially lumps all of the keys together in the same LSM tree, but sorted by the module key. But increases in db size from one module still affects other modules, since they share the same in-memory and on-disk structures.

With Rocksdb column families, each column family shares the Write-Ahead-Log, but not the memtable and table files. So writing to or dropping an entire column family is fast, as opposed to needing to scan through the LSM tree.

Our database sizes right now are probably not large enough for this to be a problem, but something to consider.

m1sterc001guy commented 1 year ago

I suppose the downside of using column families is that in order to scan all of the keys in the database, we now need to iterate over all of the modules, then iterate over all of the prefixes.

dpc commented 1 year ago

If we create a pre-prefix, are we concerned about performance at all? This essentially lumps all of the keys together in the same LSM tree,

My understanding is that this won't have any effect. The database just maintains a sorted list of all key-value pairs. It doesn't matter if the list is a, b, c, d, e, f, g, or a, b, mc, md, me, mf, g. They sort the same, they will be split into segments all the same. The only difference is that the keys will be slightly longer.

m1sterc001guy commented 1 year ago

My understanding is that this won't have any effect. The database just maintains a sorted list of all key-value pairs. It doesn't matter if the list is a, b, c, d, e, f, g, or a, b, mc, md, me, mf, g. They sort the same, they will be split into segments all the same. The only difference is that the keys will be slightly longer.

This is true, but I think it ignores the performance characteristics of the db. For example, Rocksdb maintains an LSM tree to store these sorted lists, the hot key/value pairs are in memory and the cold key/value pairs are flushed to disk in the background. Storing all key/value pairs in the same LSM tree means they all share the same block cache, memtable, and SSTable. That means the I/O pattern of the modules could change the performance characteristics of the fedimint server.

From what I've read here and here with Column Families, each family is logically and physically separated so they don't impact each other at all. It's closer to having separate databases, but they still share the same WAL which allows for atomic writes across families.

Regardless, I was thinking we could introduce a database agnostic partition concept that could be specified when creating a transaction. Then each database implementation can handle the partitioning the best way they see fit. Perhaps something like this:

`

[async_trait]

pub trait IDatabase: Debug + Send + Sync { async fn begin_transaction(&self, decoders: ModuleDecoderRegistry, partition: &str) -> DatabaseTransaction; } `

dpc commented 1 year ago

This is true, but I think it ignores the performance characteristics of the db. For example, Rocksdb maintains an LSM tree to store these sorted lists, the hot key/value pairs are in memory and the cold key/value pairs are flushed to disk in the background.

This is all true.

Storing all key/value pairs in the same LSM tree means they all share the same block cache, memtable, and SSTable. That means the I/O pattern of the modules could change the performance characteristics of the fedimint server.

But I don't think this checks out. I don't think there's any common-prefix partitioning or anything like that. That would be an odd design. It simpler for the database to be really oblivious to the key content, and just treat all keys for what they are - a sorted index. It only compares keys with each other to maintain ordering so it is able to do binary search lookups.

Key clustering is very common in real use-cases. A ton of use-cases would use the most natural and simple numerical ID as keys, or ULIDs, where first N bytes are going to be the same for all the data.

If you use u64 as an ID, starting from 0 then all your keys for a long, long time will start with 4 or more 0 bytes, etc.

AFAIK. My understanding is that a LSM database maintains a buffer in memory, that it keeps sorted (memtable), then writes the content of memtable out to the db, as sorted files (SSTables), rewrites/compacts them in the background. The fact that some keys will have lots of shared prefixes doesn't change anything. The relative sorted order is going to be roughly similar, writer patterns and relative order of elements as well.

dpc commented 1 year ago

Regardless, I was thinking we could introduce a database agnostic partition concept that could be specified when creating a transaction. Then each database implementation can handle the partitioning the best way they see fit.

That's what I want to avoid. Why introduce any new concepts, and then make db backends complex complex, where all we need is a common prefix for modules (it seems).

dpc commented 1 year ago

That would be an odd design.

Makes me think ... partitioning the memtable into multiple separate ones might be useful to help with contention if handling requested in a multi-threaded fashion...

... but then doing it naively (e.g. by just a first byte of a key) would be very lame. It doesn't take much to (re-adjust) the ranges automatically to break the key clustering, by just keeping some counters here and there.

As I said - a lot of use-patterns will have key clustering, and I've spent only 15 minutes thinking about it, so I bet they are handling it OK. :D

m1sterc001guy commented 1 year ago

Hmm perhaps we can talk about this on the call tomorrow. The point I was trying to make early didn't have anything to do with prefixes (I agree the database just treats them as sorted keys), it was more about the memtable and I/O patterns. The memtable has a fixed size and is flushed once its full, so an extreme example would be a module that writes/reads a massive amount of key/values each epoch. This could force more I/Os for the fedimint server, since they would have been pushed out of the memtable and written to disk. Like you say, I think there is value in partitioning the memtable, which is one of the benefits of Column Families.

dpc commented 1 year ago

Oh. OK. I was only concerned with solving a functional issue. If we go with prefixes, we can put each module into a separate column (or whatever it is) later if it's really required.

But I can't even imagine this being a problem. A module that pumps so much data would probably overwhelm the consensus first, and then it would create such a volume of history, that I really doubt we will ever need to worry about it.

I don't want to drag the performance down for no good reason, but optimizing it right now doesn't seem like a priority. I like geeking out on perf optimizations, and I have no idea how much does it take (maybe very little) but if you do want to pursue it, we should make it entirely optional (in the db backend implementation) optimization. Definitely we should not make implementing any new backend harder to support it.

m1sterc001guy commented 1 year ago

Ok agreed with this. I can first pursue the prefix implementation as adding an additional byte is not a big deal and I agree it addresses the issue laid out above. We can pursue column families as a performance optimization later (from the proof of concept I've done I don't think it is hard to add).

m1sterc001guy commented 1 year ago

Another thought I had that would be an advantage for physically partitioning the database using Column Families: this mechanism could be used to make db migrations safer. The copy from old column family of module X (version 1) to new column family of module X (version 2) can be made atomic, since they share a WAL, but if anything goes wrong, we don't actually corrupt the data of the old module and the user could potentially roll back by downgrading their fedimintd code.

As already discussed, this can be done in addition to the logical partitioning using a module prefix on each key.

m1sterc001guy commented 1 year ago

We currently have database transactions isolated, but not Databases https://github.com/fedimint/fedimint/pull/1486/files

m1sterc001guy commented 1 year ago

Databases can now be isolated with new_isolated https://github.com/fedimint/fedimint/pull/1486