deanlandolt / bytewise

Binary serialization of arbitrarily complex structures that sort bytewise
113 stars 11 forks source link

Expand on use cases #19

Open emilbayes opened 9 years ago

emilbayes commented 9 years ago

Hi Dean,

I asked a question #leveldb the other day about partitioning my dataset in leveldb, to get "implicit indexing" (example will follow) and @ralphtheninja sent me over here. I had looked at bytewise and I'm very intrigued by all the use-cases but I fail to see how they can be carried out by bytewise. Maybe I lack formal CS education or just imagination :stuck_out_tongue:

Anyway, currently I have a key structure like {gender: 'M', name: 'Emil'} which is then encoded as M!Emil. This means I can efficiently retrieve all records that are male or female, or all male names starting with E, but not much more than that. I can effectively use range queries to do search suggestions as shown by level-trie since lexicographic sorting will have common prefixes in order.

However how would you do partitioning with bytewise? How about full text search? I tried to look into 'CouchDB-style "joins"', but the links seem to be dead :disappointed:

The original IRC correspondence on #leveldb:

14:06 <tixz> How will `keyEncoding: 'json'` behave when specifying ranges? I mean, V8 doesn't reorder keys so {foo: 1, bar: 2} !== {bar: 2, foo: 1} when using JSON.stringify, so my ranges would become unreliable, right?
14:07 <tixz> I can't really seem to find much info on keyEncoding, but I love the trick that a carefully constructed key can be used as an index :D
18:00 <ralphtheninja> tixz: you can use other encodings, e.g. bytewise https://github.com/deanlandolt/bytewise
18:11 <tixz> ralphtheninja: I looked at bytewise, but can I choose the order of keys? I have a key structure where I can exploit the keys for partitioning/indexing the dataset, but it requires the keys to be in a specific order
18:13 <tixz> ralphtheninja: All the uses cases sound wonderful (https://github.com/deanlandolt/bytewise#use-cases) but I guess I lack imagination :p
18:45 <ralphtheninja> tixz: post an issue on bytewise repo and ask dean :)
18:46 <ralphtheninja> be prepared for an extensive answer to your specific use case :)
19:07 <tixz> ralphtheninja: Haha, thanks will try :)
dominictarr commented 9 years ago

@emilbayes it doesn't do anything magic, it just chooses an order with which to encode the keys. I think it's aphabetical, and g is before n, so it would encode as {gender: 'M', name: 'Emil'}, which would be different to {name: 'Emil', gender: 'M'}. If you want to query on multiple keys you need multiple indexes.

However, it does have well defined orderings for all it's encodings, so it is is still very useful for easy to define ranges (better than !~ separators) the object as key encoding is there for completeness, but not really that useful, I have never used it, but had lots of success with combinations of strings, arrays, numbers as keys.

deanlandolt commented 9 years ago

What @dominictarr said :)

Though I actually killed the object encoding stuff w/ the 1.0.0 release since it really wasn't useful as is. I have some ideas for how to make it useful...

There are various ways to reason about what an object is, but ignoring all the "class-like" stuff, there are two "kinds" of objects that are especially common -- maps (associative arrays) and records (essentially just tuples w/ "named" string indexes rather than seemingly arbitrary integer indexes). There's an obviously "sane" way to encode either of these lexicographically, but without some additional metadata, it's really not possible to determine which kind of encoding -- which sort semantics -- you really wanted, so it's not really worth bothering with for now.

You can apply a transform to your data to encode whatever semantics you want with just raw arrays, strings, and numbers -- you just have to take care to reverse this transform when decoding. I'd like to make this process a little more formal, allowing you to efficiently encode various properties into the type tag, which would allow anyone to decode to your spec w/o the need for an out-of-band schema. I've worked through most of the details, but nothing useful to show yet.

As for partitioning, if you have a specific use case in mind I'd be happy to elaborate on some possible approaches. To your specific full-text search question, one nice property of tuples is as you encode terms to varying degrees of specificity (which is most of the work that an FTI pipeline is doing), you might write them into tuples from least specific to most specific -- creating a hierarchy of equivalence classes, allowing you to view terms (words, n-grams, whatever) at whatever level of specificity is appropriate to your search. This is one way to handle partitioning -- it's a little verbose, but a lot of the redundancy should be compressed away at rest by the snappy db compression, and typically data on the wire's typically going to get some compression as well.

emilbayes commented 8 years ago

Sorry for getting back so late.

What you say about the "duality" of Javascript objects is exactly my problem. I want "named" tuples, and tried to solve it here, but realise now I should have built it on bytewise. What are the solutions that you think you have worked out?

I hadn't considered just being verbose with the FTI stuff. Maybe I'm still trying to be too optimal, too early. With regards to partitioning, all that I was thinking of was the specific usecase I had at hand, gender being dichotomous and name being an arbitrary string. Thank you for the elaborate answers.