fjall-rs / lsm-tree

K.I.S.S. LSM-tree implementation in safe Rust
https://fjall-rs.github.io/
Apache License 2.0
28 stars 1 forks source link

feat: IO trait (to permit plugging in cloud blob storage) #42

Open rbtcollins opened 1 month ago

rbtcollins commented 1 month ago

Is your feature request related to a problem? Please describe.

I find many small services end up having most of their cost a running PostgreSQL server which they barely use. Direct blob storage starts to look very attractive - and while multiple-instance services couldn't use fjall, slapping a GRPC front-end onto a single service that provides the data model would work very well I think. But only if the IO used would work on blob storage rather than requiring local disk.

Describe the solution you'd like

An IO trait compatible with e.g. Azure/AWS/Google Rust SDK's for blob storage. That needn't be async, since an internal channel can be used to bridge to async, and I understand lsm-tree to not be async internally.

Describe alternatives you've considered Writing a new similar project natively targeting blob stores

marvin-j97 commented 1 month ago

Would you mind sharing how you would expect this to look like? I'm thinking Tree being generic over a trait F that requires Seek + Read + WhateverElseIsNeeded similar in the way LevelDB defines its files (https://github.com/google/leveldb/blob/068d5ee1a3ac40dabd00d211d5013af44be55bea/helpers/memenv/memenv.cc#L200, https://github.com/google/leveldb/blob/068d5ee1a3ac40dabd00d211d5013af44be55bea/helpers/memenv/memenv.cc#L185)?

The crate is definitely very hard-coded to use std::fs right now, so it would be quite a huge refactor to get rid of it all. Plus with V2 there's another crate that would need the same treatment. Contributions are greatly appreciated.

rbtcollins commented 2 weeks ago

Been thinking a bit on this.

Code golf bits:

https://matklad.github.io/2021/09/04/fast-rust-builds.html suggests having the amount of generic code at the interface to crates very thin.

So this suggests: A generic struct / structs that expresses the pluggable nature with traits as you describe. An inner struct that is not generic but holds a dyn impl of the generic type

However on reflection the bigger problem is going to be function colouring: if the traits are synchronous, then the impl for cloud blob storage is going to be holding an async runtime of some sort, and then blocking on calls into it everywhere (even if masked via a channel as I described). This is not ideal :/.

If the core itself was actually async, with the existing sync interface a thin shim over the top, that would work pretty nicely I think. There are some good reasons to want the core to be async btw - for prior art I'll point you at FoundationDB, which build the entire system around async kernel IO, and has some very nice testing and performance benefits as a result. In linux uring, and in Windows, IO Completion Ports, offer non-blocking IO even for local disk - and with SSDs with deeper and deeper IO queuing, this unlocks a reduction in thread count and context switching even in very high IO situations. All of which seems applicable to fjall in its embedded use case, rather than being a special case for the cloud blob storage scenario I've described :)