How to formulate prefix-based queries / LevelDB `lte`-boundaries

deanlandolt / bytewise

Binary serialization of arbitrarily complex structures that sort bytewise

113 stars 11 forks source link

How to formulate prefix-based queries / LevelDB `lte`-boundaries #17

Closed loveencounterflow closed 9 years ago

loveencounterflow commented 9 years ago

I'm using bytewise as an encoding in my levelup-powered DB design which is a bi-partite, keys-only DB that uses subject key/value pairs that are aligned with object key/value pairs. Currently, my primary keys look like [ 'so', [ sk, sv, ], [ ok, ov, ], idx, ] (where so means 'subject to object', k is for 'key', and v is for 'value'). My secondary (i.e. index) keys are reversals of the primary keys: [ 'os', [ ok, ov, ], [ sk, sv, ], idx, ].

The encoding that bytewise offers is a great fit for this since now i can use arbitrary data and still keep a meaningful ordering. However, I'm still grappling with a solution for my primary use case, which implies finding data for a given topic among the os keys. Before using bytewise, I used to formulate keys as strings; I could then open a LevelDB write stream with, say, { gte: 'os/foocount:0', lte: 'os/foocount:0$', } to retrieve all and only the keys with the prefix 'os/foocount:0' (the $ sign is actually a 0xff byte that I inserted into the buffer that represents the prefix).

The good thing about the string-based indexes is that they're conceptually rather simple and injecting a 0xff byte is sure to work correctly; the downside is that you continually have to formulate the keys, escape meta-characters, take care to apply a correct padding to your numerical values, etc etc.

So the question is: given a partial key to be used as a prefix like [ 'os', [ 'foocount', ] ], what is the correct way to get the gte and lte parts that I need for a LevelDB readstream? Currently I encode my array and remove all trailing zero bytes, which gives me the gte key, and then copy that buffer and set the last byte to 0xff, but that feels so wrong.

dominictarr commented 9 years ago

null and undefined are the bottom and top values, so if you want to get all keys that start with a subject you would use:

//I normally store min max values in constants like this
//so that uses are more readable.
var LO = null, HI = undefined
{
  gte: ['os', ['foocount', LO]],
  lte: ['os', ['foocount', HI]],
}

deanlandolt commented 9 years ago

You shouldn't have to hack the keys to remove trailing zero bytes, nor hack in the 0xff byte -- they're not incorrect but should feel icky. @dominctarr's answer works great, and should feel a lot less wrong.

The only place this starts to fall down is when you have to be sure you're capturing only strings in your query range, and nothing else. This might occur when a given keypath component could be a string or an array -- which doesn't look to be the case for the index structures you described, but I can see why this may be a concern. This also comes up for a use case I'm chasing -- strongly-typed keyspaces, where we can use binary compare for cheap runtime type guards.

For the former case, you can probably get away with using closed ranges with empty array components to capture all strings. For the latter case you'd probably want something a little more stronger, and actually the 0xff byte hack is the best way to get there for now. But this is something we should be doing for you, as I'd suggested here: https://github.com/deanlandolt/bytewise/issues/16#issuecomment-94357592. Hacking on the underlying buffers would largely defeat the kinds of guarantees you'd be after with these kinds of highly targeted queries.

But again, this is just to add a little color to @dominictarr's answer, which is the correct way to reference an entire subspace. I'm closing this for now, but please reopen if we've misunderstood in any way.

loveencounterflow commented 9 years ago

@dominictarr thx for the suggestion but I typically need partial string matching. for example, I have a collection of so-called strokeorders (i.e. writing sequence of elementary strokes) that look like this:

[ 'os', [ 'strokeorder', '<352513553254>',          ], [ 'glyph': '𤿯', ], 0, ]
[ 'os', [ 'strokeorder', '<3525141121>',            ], [ 'glyph': '𠴦', ], 0, ]
[ 'os', [ 'strokeorder', '<35251454>',              ], [ 'glyph': '𨒡', ], 0, ]
[ 'os', [ 'strokeorder', '<3525152>',               ], [ 'glyph': '邭', ], 0, ]
[ 'os', [ 'strokeorder', '<352515251115115113541>', ], [ 'glyph': '𪚫', ], 0, ]
[ 'os', [ 'strokeorder', '<35251525112511511>',     ], [ 'glyph': '𪚧', ], 0, ]
[ 'os', [ 'strokeorder', '<352515251214251214>',    ], [ 'glyph': '𧑴', ], 0, ]
[ 'os', [ 'strokeorder', '<3525153>',               ], [ 'glyph': '劬', ], 0, ]

In order to find all and only those characters whose strokeorder starts with <352514, I need a pair of keys similar to

[ 'os', [ 'strokeorder', '<352514↓', ], ]
[ 'os', [ 'strokeorder', '<352514↑', ], ]

where the arrows indicate lowest and highest sentinels. Whichever technique is used to obtain such prefixes, it should ideally seamlessly work with other values, at least with numbers.

deanlandolt commented 9 years ago

I agree that the technique should work seamlessly with other values, though I'm not sure how it would apply to numbers. Can you give an example? Maybe if you're modeling something like compound numbers, e.g. complex or hypercomplex values, or maybe matrices (which can model just about anything) -- these may want special sort semantics anyway.

IIUC this issue only comes up with sequence-style types -- @dominictarr's null/undefined suggestion works nicely for structured sequences like arrays but falls short for strings or buffers. For buffers you should be able to just use \x00 and \x255 (these bytes get escaped when nested in arrays, but the escapement routine maintains ordering). Strings are even harder thanks to the subsequent utf8 encoding that happens. As you pointed out with your example here, this is clearly a bug. I've been happily laboring under the impression \uffff was Good Enough until you came along to show how unicode screws me again!

Speaking of -- I'm almost tempted to just move to UTF-16BE for simplicity. I haven't found any scenarios where round-tripping to and from UTF-8 breaks sort order but it's been a long time since I researched this and I'm not convinced it's perfectly safe, and your examples in that other issue make me feel even more squeamish about it. Any space efficiencies in the database or on the wire are likely to be mitigated by compression, so UTF-8's complexity doesn't seem worth the cost (for bytewise's very narrow use case, at least). Thoughts? I know it this doesn't fix your issue, but it should make it less insane to hack on raw string-encoded buffers without being a unicode expert...right?

For an actual fix I guess we'd need a wrapper object that allows you to provide a string and get a next or previous value (it'd just do the buffer hacking for you and give you back a sigil object that bytewise.encodes to the right bytes. The thing that was holding me up a way to help ensure these sigil objects can't be used as keys, but that's really not necessary.

The design I had in mind was to add functions to the base types module like

sorts.string.next(string) => Buffer
sorts.string.previous(string) => Buffer

We could add them for every type -- dense scalar types will return a valid instance, e.g. sorts.boolean.next(false) === true and probably should be idempotent at the boundaries, e.g. sorts.boolean.next(true) === true, or some such. And of course sorts.boolean.previous would do the inverse.

Numbers and dates could be considered "dense" in this sense -- sorts.number.next could return the next number -- e.g. number + Number.EPSILON for sufficiently small numbers or date values (I know I've seen this algorithm around somewhere -- might even be in the IEEE spec). Though I've been batting around the idea of allowing a suffix byte to make room for additional metadata to be encoded (one us case is to allow arbitrary precision numbers which sort correctly with floats but can carry an arbitrary amount of additional data after the 8 float bytes -- if we ever do this, next and previous would have to change accordingly).

And of course, sorts.string.next/sorts.string.previous, which is the only one that's especially valuable for most use cases. This should just be a matter of appending bytes after the encoding phase, but it'd be helpful if you could confirm that, or sketch an alternate algorithm that'd work.

Does this API sound reasonable? If so, I hope to land something soonish. If you can give me some guidance on the string case I could land this today or tomorrow and cut a new release.

dominictarr commented 9 years ago

The simplest way might just be to represent your stroke order structure as an array instead of as a string. Another technique that might be interesting is to think carefully about how to represent the end of the string.

I wrote this module a while back: https://github.com/dominictarr/between You can use it to generate strings such that you can always generate another string that fits inbetween any two other strings. The trick is just never allowing a string to end with the lowest sentinel (decimal numbers between 0-1 can contain 0 (such as 0.01) but the only decimal that may end in a 0 is 0.) this means you can always construct a number that comes before another number by decrementing the last value, adding 0 and then any other value. And you can make a higher sorting number just by incrementing or appending.

loveencounterflow commented 9 years ago

There's a lot of issues here, so I'll go forward in a piecemeal fashion.

@deanlandolt "move to UTF-16BE for simplicity" —please don't. The world is using UTF-8 these days, and while UTF-16BE is somewhat simpler conceptually, it still not without its own problems. Most notably, you'd still have to deal with those dreaded surrogate pairs. Some people may think that Unicode code points above U+ffff are 'esoteric' and 'never really needed in real-world applications' (they've seriously been called 'astral code points'), but i for one will always copy-and-paste Chinese characters from across the board to new applications and will react to any garbled data with ditching said piece of software. I've discovered that more recent editions of JEdit (which i once ditched in favor of Komodo edit for precisely this reason) manage to deal with 'astral' code points, and so should any other reasonable code that deals with text in 2015.

Concerning the bytewise encoding, there is a nice UTF-8 codepage layout table on Wikipedia which marks those bytes that are never legal in an UTF-8-encoded text; those are 0xc0, 0xc1, 0xf5, 0xf6, 0xf7, 0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe and 0xff, 13 altogether if I'm counting right. My hunch is that these bytes should be used for type markers, because when you're scanning the bytes of an UTF-8-encoded string, you must currently check for a 0x00, which, although a legal UTF-8 byte, cannot appear inside a string that is an element of an array—using illegal bytes as type markers would both terminate the string and start a new value, or end the array. IMHO this is superior to the current approach, which uses two slightly differing string encodings if I'm not mistaken (0x00 is preserved in a string as such, but inside an array, it's changed to 0x01 0x01, and, consequently, 0x01 is expressed as 0x01 0x02).

"I haven't found any scenarios where round-tripping to and from UTF-8 breaks sort order but it's been a long time since I researched this and I'm not convinced it's perfectly safe, and your examples in that other issue make me feel even more squeamish about it. Any space efficiencies in the database or on the wire are likely to be mitigated by compression, so UTF-8's complexity doesn't seem worth the cost"

—UTF-8 is rather space-efficient and always preserves code point ordering, so there's nothing to be afraid of here. UTF-8 is conceptually more complex than UTF-16, but then it's universally available and, importantly, the default encoding of nearly 100% of current software. You'll be world-compatible using UTF-8, and "do strange things" otherwise. Given they're unused bytes and illegal byte sequences in UTF-8, it's very possible that some smart guy can come up with an encoding that is more space-efficient. But you'll be the only person in the world that uses that encoding which is not so cool.

BTW UTF-8 is not that space-inefficient; that is more of a pet peeve of people who've been brought up on ASCII (1 byte == 1 character). I just copied the opening paragraph of the Chinese Wikipedia on UTF-8 and encoded it; turns out it has 984 characters of Chinese and interspersed English; in UTF-8, that text occupies a mere 2246 bytes, which means that the average character needs a mere ≈2.3 bytes. That's not bad at all!

loveencounterflow commented 9 years ago

Next, here's how I found at least a temporary solution.

First, I moved from nested to flat lists; nested lists looked good at first but it's not clear that the extra structure is going to serve any real purpose. Means the data I'm serializing now looks like

[ 'os', 'strokeorder', '<352513553254>', 'glyph': '𤿯', 0, ]
[ 'os', 'strokeorder', '<3525141121>',   'glyph': '𠴦', 0, ]
[ 'os', 'strokeorder', '<35251454>',     'glyph': '𨒡', 0, ]
[ 'so', 'glyph': '𤿯', 'strokeorder', '<352513553254>', 0, ]
[ 'so', 'glyph': '𠴦', 'strokeorder', '<3525141121>',   0, ]
[ 'so', 'glyph': '𨒡', 'strokeorder', '<35251454>',     0, ]

Much better. Symbolically, this is ['so',SK,SV,OK,OV,IDX] for the index entries (S = subject, O = object, K = key, V = value; the 'so' I call the 'phrasetype').

Next, let's restrict the discussion to flat lists for the entries as such, strings for the SK, OK keys, and strings and numbers for the SV, OV values.

We can then do away with numbers immediately because when you want to scan a numerical range, there are only a few cases: x === n (trivial), m <= x <= n (we know how to do that), and x >= n, x <= n, for which we don't need byte-level trickery, since there are (thankfully) –Infinity and +Infinity.

This leaves us with the simplified question: "how do we write lte, gte sentinel byte sequences to capture ranges of string values in flat lists?", and the answer is both straightforward and simple. All that you ever wanted to search for under the circumstances as given are the two related tasks:

(1) given a phrasetype PTq and a partial first key K0q, find me all entries e whose first key (OKe in the case of 'os', SKe in the case of 'so') starts with the characters as given. This is a meaningful question in so far as you may want to use structured keys, such as, say, name/first, name/middle, name/last, to obtain meaningful and convenient first-order aggregates of data.

(2) given a phrasetype PTq and a fully qualified first key K0q (again, OK or SK) and a (possibly empty) prefix VP0q, find me all entries whose PTe and K0e are equal to PTq, K0q, and whose V0e is a proper characterwise prefix (initial substring) of V0q.

Turns out both requirements can be satidsfied with the following simple procedure, here expressed in CoffeeScript:

@_query_from_partial_key = ( db, pkey ) ->
  for element in pkey
    throw new Error "illegal prefix-key: #{rpr pkey}" unless CND.isa_text element
  base                = @_encode db, pkey
  length              = base.length
  throw new Error "illegal prefix-key: #{rpr pkey}" unless base[ length - 1 ] is 0x00
  throw new Error "illegal prefix-key: #{rpr pkey}" unless base[ length - 2 ] is 0x00
  base[ length - 2 ]  = 0xff
  gte                 = base.slice 0, length - 2
  lte                 = base.slice 0, length - 1
  return { gte, lte, }

In essence, what the above does is checking that (some of) our restraints hold; in that case, the bytewise encoding of the partial key pkey will end in two bytes. We then obtain the lower key by making a slice that excludes the trailing two zero bytes, and the upper one by a slice that is one byte longer and has that byte set to 0xff (yes, you could leave in those zero bytes in the lower key).

This seemingly excludes a lot of conceivable use cases, but a closer look reveals:

numbers are better dealt with by encoding JS numbers anyway, as shown above;
dates are similar to numbers, so the same applies to them;
we may discard searching for 'objects' (hashes, (plain old) dictionaries (PODs)) right away as those have a key/value ordering issue (i.e. you generally need additional code to ensure that none of your {foo:1,bar:1} objects ended up as {bar:1,foo:1}, and if you do that, you might as well use lists right away);
apart from numbers and PODs, the remaining interesting values are largely restricted to null, 'false,undefined` and so on, which do not lend themselves for range queries.

And yes, it would be possible to encode the strokeorder examples using lists of digits, but, given my data and what processing routines I have developed over the time, that would be counter all experience and also rather inefficient. To me the phrase "if you want to search string prefixes, convert all strings to lists of characters" does not match with the phrase "bytewise is a good fit for my application" (which I believe it is).

deanlandolt commented 9 years ago

@loveencounterflow thanks for the detailed responses. I think I know enough to move forward with next and previous helpers for hacking encoded strings.

"move to UTF-16BE for simplicity"—please don't. The world is using UTF-8 these days, and while UTF-16BE is somewhat simpler conceptually, it still not without its own problems. Most notably, you'd still have to deal with those dreaded surrogate pairs.

FWIW I completely agree re: respecting code points outside the BMP (I didn't realize astral was a derogatory term :)). But ISTM the traditional arguments for preferring UTF8 don't really apply in the context of embedded encodings in something like bytewise. For instance, the world also uses IEEE 754 for encoding numbers, but we have no choice but to break from conventional wisdom here to preserve ordering, encoding negative numbers with the bits inverted. If the sort semantics of any legal javascript string are compromised by a UTF8 round-trip I'd say that's a bug.

The goal is to also preserve ordering semantics for all legal javascript strings as well, which is why I'm suspect of UTF8 (or UTF16 for that matter). IIRC what I'm thinking might actually be called UCS2 -- no notion of code points at all, just a straight embedding of javascript "characters" (in the imprecise String.fromCharCode sense) that can be faithfully revived into an identical string -- even if it's invalidinvalid from a unicode perspective. IIUC this is impossible with UTF8, but please let me know if this analysis is flawed.

Regardless, any change to string encoding would be breaking, so it'd be opt-in anyway.

deanlandolt commented 9 years ago

we may discard searching for 'objects' (hashes, (plain old) dictionaries (PODs)) right away as those have a key/value ordering issue

Just FYI, objects are treated as ordered maps -- enumeration order of keys is guaranteed by the language semantics. It's a PITA to change the order of keys (requires a shallow copy) so it's not such a great option for mutable hashes, but could be used sensibly with classes, which would allow you to shape the enumeration order for the base set of keys where ordering is relevant. But yeah -- you're probably better with a list :)

loveencounterflow commented 9 years ago

@deanlandolt "order of keys is guaranteed by the language semantics"—almost but not quite. For one thing it is nowhere (in the classical standard at least) said that ordering of keys should be preserved in any way; it is just a 'living standard' that browser makers turned out to make it so that ordering of insertion is preserved. By and large. Because when the Chrome team stepped in, they made a stupid micro-optimization that treats key differently when they look like spelling out JS numbers (see e.g. here), which means that ordering now does depend on the environment you're in, making it unusable or at least very hard to use right if you're writing a cross-platform library. In the case of bytewise I think this means that users are responsible for whatever ordering, and they should be warned that at least this one factor makes objects a bad choice for building e.g. LevelDB indexes with them.

loveencounterflow commented 9 years ago

@deanlandolt "what I'm thinking might actually be called UCS2 -- no notion of code points at all, just a straight embedding of javascript "characters" (in the imprecise String.fromCharCode sense) that can be faithfully revived into an identical string -- even if it's invalidinvalid from a unicode perspective. IIUC this is impossible with UTF8" —it is much more possible to store, sort, and retrieve arbitrary Unicode strings across all planes using UTF-8. You do have to be on the guard concerning the JS API in places for historical reasons. In short, prefer codePointAt and fromCodePoint over the older charCode API (it is very much possible to just copy-and-paste the code from the MDN site or another source to make it so that it just works transparently).

After years of working daily with Unicode in Python and JS I think I can clearly say:

(1) do not use UCS-2; it has been an outdated standard ever since Unicode v2 appeared 20 years ago; it can as such not represent code points outside the BMP;

(2) UTF-16LE is capable of representing all code points in either 2 or 4 bytes (BTW NodeJS treats 'ucs2' as a synonym for 'utf16le', which is not quite correct IMO); it is actually slightly less efficient than UTF-8 for some strings b/c UTF-8 encodes some characters using 1 or 3 bytes.

(3) do embrace the word "code point", it is a good one. When talking about strings on the level that we're on here we never care about "characters" or "glyphs", all that we're concerned with is the faithful and suitable representation of series of integer code points.

(4) again, UTF-8 preserves the lexicographical ordering of integer code points across all Unicode planes, so there's nothing to worry about here.

As for breaking changes to string encoding, that's what semantic versioning is for, i guess.

deanlandolt commented 9 years ago

Alright, I'm satisfied that UTF-8 should be fine. Thanks again for your patience w/ this unicode newb :)

I'll get those next/previous helpers in place and close this ticket.

Re: breaking changes, every breaking change to a serialization format like this creates a rift between versions -- IMHO you may as well cut a new library with a distinct name than bump your version with a breaking change. The latter could subtly screw users. So my strategy is to make breaking changes opt-in, and eventually create a new lib (e.g. bytewise2 or something like that) that inverts opt-in behavior to opt-out, and provides some conversion utilities.

deanlandolt commented 9 years ago

Still a little more work to do but last night I cut a bytewise v1.0.0 release and started pushing some prelim support for upper and lower bound generators to support various query use cases. Here are some tests that should give an idea of the API surface: https://github.com/deanlandolt/bytewise-core/blob/master/test/ranges.js

The API is pretty simple -- bw.bound.lower() and bw.bound.upper() are the exclusive-ended analogs of null and undefined (which always felt sloppy to me). You can get a lower and upper bound on each type, e.g. bw.sorts.string.bound.lower(), bw.sorts.date.bound.upper() (which fills the gap created by the lack of an Infinity date), etc. For complex types (buffer, string, array) you can also provide a prefix arg to the lower or upper bound functions, e.g. bw.sorts.string.lower('foo') will sort after the "foo" string but before 4-character strings starting with "foo", and bw.sorts.string.bound.upper('foo') sorts after any "foo"-prefixed string but before "fop".

The same idea will apply for buffers or arrays -- bw.sorts.array.bound.lower([ 'foo', 'bar' ]) will sort after [ 'foo', 'bar' ] and before [ 'foo', 'bar', ... ], while bw.sorts.array.bound.upper([ 'foo', 'bar' ]) will sort after [ 'foo', 'bar', ... ] but before [ 'foo', 'bas' ].

These boundary objects can be nested when creating encoding data:

{
  gt: bw.encode([ 'foo', [ 'bar', bw.sorts.string.upper('baz') ] ]) },
  lt: bw.encode([ 'foo', [ 'bar', bw.sorts.string.upper('baz') ] ]) }
}

The string prefixing stuff just appends 0xff as the upper bound byte, which should be totally safe. This was my original suspicion and after rereading @loveencounterflow's comments (particularly this one) I see this is actually what was being suggested (I think we were talking past each other a bit due to some sloppy word choices on my part).

When done outside the encoding routine 0xff is an ideal upper bound for unicode since no encoded utf8 byte could ever be 0xff. Exclusive lower bounds are a little trickier due to null bytes being valid utf8 and the fact that we're already putting them to use for array element separation and termination. There's no way to make an exclusive lower bound that's above 'foo' and below 'foo\x00', but that's probably alright -- and you can already use gt: 'foo' for this case anyway. It would be nice if we had some utilities for generating complete range specs (already in the works via bytewise-uri), but this is at a lower level and is intentionally query-agnostic, so it would be nice if we can say we're consistently generating exclusive bounds. Whether or not they're also valid to use inclusive bounds is type-specific -- though we should probably track that on the generated object so we could allow query tooling to warn or throw when they're given an invalid spec.

The implementation currently uses a 0x00 byte but I'm changing that now so that lower bounds with prefixes are just identity functions, with the documented caveat that they should be used with gt in range specs. I was hoping to find a serialization that was distinct from any legal encoded key but that's probably a lost cause.

I regret not using consistent escapement rules for strings and buffers outside of arrays -- it would have allowed for simpler code and cheap buffer concatenation when creating arrays. I also wish I'd shifted the escaped byte a lot further, leaving some room to encode extensions in those currently inaccessible extra bits we're leaving on the table. Ah well.

Anyway, that's the API and rationale. Feedback or patches very welcome -- I'm hoping the API is flexible enough to support all the various query use cases.

deanlandolt commented 9 years ago

Other than some docs I think this is just about done.