deanlandolt / bytespace

Efficient keypath subspaces with bytewise tuples
MIT License
14 stars 1 forks source link

how does this differ from level-sublevel/bytewise? #2

Open dominictarr opened 9 years ago

dominictarr commented 9 years ago

level-sublevel has a supported bytewise encoding, am curious what this does differently, and how compatible it is?

deanlandolt commented 9 years ago

It should be API-compatible -- it implements the sublevel method and pre/post hooks w/ the same API, and supports the prefix arg in batches. Honestly, it's not all that different from level-sublevel@6, but makes it easier for us to embed nested spaces from the client side of a multilevel connection, letting us build little sublevel-like proxy subspaces over a larger dumb keyspace (so we don't have to redeploy the underlying db every time we want to tweak one of the subspaces).

It also opens the door to some encoding-specific optimizations. For example, we don't bytewise-encode the whole keyspace, just the namespace component -- the key component can be encoded however the user likes, or might be an arbitrary buffer. We only have to encode the namespaces once, on create, and never have to bother decoding them -- we just slice them off and decode the remaining key based on the encoding specified by the user. This might also be how sublevel@6 does it -- it's been a while since I've looked closely at any of this -- if so, that's awesome, and makes it more likely that we could converge with sublevel at some point.

There are some other desirable properties we were going for -- for example, we wanted the ability to isolate sublevels -- without an explicit reference to a db you can't write into a space -- even a subspace. This isn't actually the case right now, but should be easy enough to achieve w/ a little more work. There are a bunch of little encoding-specific optimizations I want to tinker with for minimizing namespace keys down to a few bytes while optionally allowing order to be preserved (allowing for a partial order of sublevel namespaces to be defined). NYI, but easy enough to do. Shrinking the size of keys would be a big win -- w/ snappy it shouldn't matter much in the db itself, but on the wire (e.g. over multilevel) long namespace keys are pretty wasteful. This might also be something we can push down into a distinct namespace encoding in sublevel.

I'd love to move away from yet another lib to maintain and just use sublevel -- and I suspect all the above could be made to work in sublevel@6 as is. But its underlying nut/shell architecture, while really clever, makes it a little tough for me to grok. The approach in bytespace is a little different, and IMHO simpler -- it's just a straight up wrapping of the requisite levelup methods. Ultimately I'm hoping to define an even simpler API than leveldown: one write method, an atomic batch write (something like chained batch, as it's the most primitive, but could be something like a writable stream/iterable API that normalizes chained batch w/ array-based batch), one read method (akin to createReadStream, but with sane ltgt options, coherent keys/values projection options, and optional seeking, which is supported natively by some dbs, including leveldb iterators), along with async open and close methods. This would be a lot cleaner to wrap, and every leveldown (and hence, levelup) API could be modeled on top.

There are a few other things I can't quite remember, but that's the overall rationale. IMHO, with the kind of lower-level leveldown API (leveldowner? levelbottom?) I described above, we could layer something like bytespace on top really nicely, but with pluggable ns-prefix encoding (essentially a stripped down sublevel). It would be completely keyEncoding and valueEncoding-agnostic (like sublevel and bytespace), and ideally the ns-encoding is separate from the keyEncoding (so it never has to be decoded, like bytespace, and maybe sublevel@6?). Essentially it takes control of the keyspace and is responsible for managing how various subspaces can be created and referenced. It also dictates the kind of key encodings that can be used for keys within any of its namespaces, and the allowable encodings when creating any nested subspaces. Not much different than what we have today -- it'd just unify the API and extend the capabilities of the sublevel method a bit.

This layer would also abstract away distinction between keys stored as buffers or strings -- that would be a property of the underlying store (which might support one or both, but always preferring one over the other). The leveldown interface could be layered on top of this, and levelup on top of that. We might also want to encode something like keyEncoding into the namespace itself -- at least, the space of possible types. This would only make sense for something like bytewise-like keys, with a total order for different types defined. This isn't strictly necessary, but the keyEncoding stuff in levelup mixes several very separate concerns. The same could be said of valueEncoding -- and both should probably be reconsidered a bit...

For example, when you provide a valueEncoding: 'json' with a put you're saying "transform this structured value into some lexical string using some transform (JSON.stringify). But does the underlying space support the lexical space of JSON strings? Or just some subset (e.g. just utf8 strings)? Or not at all, e.g. uint8[]. These uint8 "arrays" might be stored as Buffers, hex-encoded strings, base64-lex-encoded strings (base64 w/ an alphabet chosen which sorts lexicographically) -- what the underlying data store uses is immaterial to API -- any of these encodings are isomorphic, and allow any more "refined" lexical space to be embedded.

Defining these "refined" lexical spaces (e.g. the space of utf8-encoded JSON.stringify values, etc.) is surely a rabbit hole not worth exploring, but could be useful to consider with respect to SoC -- any given db has a lexical space it allows you to store keys in, and another for values (possibly the same, but not always). How keys should be encoded for storage is only relevant to an API like leveldown which allows these primitive "lexical" keys to be specified as buffers or strings, and perhaps also to a namespace layer like sublevel or bytespace (they may want to know if the backing store can store and sort structured arrays more efficiently, e.g. an IDB or couch backing store). But these details are never relevant to end users, or even to an API like levelup that might allow alternate encodings to be specified to transform values into some specific lexical space. ISTM levelup is just a special case of a more general namespacing API for dbs with an empty namespace -- assuming it's essentially free to run bytespace or sublevel without a namespace, the levelup API becomes moot -- it's completely subsumed by a more general namespace API.

The leveldown API goes most of the way in drawing out this separation, but allows the buffer/string distinction to leak all the way through to the end user, e.g. allowing them to supply Buffer keys to a database that only supports hex strings (e.g. one using level-2pc). Something like a lowest-level levelbottom API could help close this leak by allowing a more primitive set of transforms to be defined that are completely data-type agnostic.

Holy shit this got long -- apologies for all the words...

tl;dr: bytespace is intentionally api-compatible with sublevel, architecturally similar, but arguably simpler, making certain encoding experiments a little easier. Ideally, it lays the groundwork for a more foundational API than leveldown, one that would make libs like bytespace and sublevel trivial, and allow libs like level-live-stream and level-ttl to be independent of the namespace layer, and even independent of a level-hooks-like lib (which would also be trivial). If they need sublevels they could create them explicitly, w/o assuming the underlying db is "sublevel-enabled" (exposes a sublevel method). These nested namespaces could be completely private, unreachable by any other namespace/sublevel (except the root, which is responsible for ns prefixing, and thus has the capability to construct any namespace). That's the dream at least.

dominictarr commented 9 years ago

@deanlandolt the architecture you describe is pretty much what level-sublevel@6 is. nut.js is a subset of leveldown (apply = batch, and iterator) nut handles the all the generic things, encoding, decoding, setting up namespaces. shell just wraps the nut in the levelup api + sublevel & pre & post.

just wrapping a levelup is how sublevel worked <=5, and is certainly the most obvious way to get started, but i came to roughly the same conclusions as you did and wrote 6 ;)

possibly the way you want to do key encodings could be done with lsl with a custom codec?