Open deanlandolt opened 9 years ago
String escapement now fixed on master.
I also went ahead and implemented the change to encode top level strings and buffers the same as nested ones in a branch: https://github.com/deanlandolt/bytewise-core/tree/escapement
This will need to go behind a flag to avoid a major version bump. My thinking anything that's breaking would go behind a flag, and when we cut releases we can cut two releases simultaneously -- a 1.0 with the new feature flags unset by default and a 2.0-rc with some or all of the feature flags set to true by default. This would make it a lot easier to transition between versions. Once we finish 2.0 and start working on a 3.0-rc we can do the same thing, cutting 3 releases with appropriate feature flags. That way we don't have to have multiple dev lines -- just one master with all the latest and greatest fixes.
The escapement procedure for strings and buffers we're using today is pretty goofy, and it's particularly problematic that there are two distinct serializations depending on whether the value is top level or nested in an array.
@loveencounterflow proposed that we use invalid utf8 bytes for array separation and termination, but I can't see a clean way to make this work as the array separator char absolutely MUST sort above any valid element. This would be a pretty radical change (unless I'm misunderstanding)...
As a simpler alternative, I'd like to move to something like modified UTF8 for string encoding. It's not clear whether this completely preserves sorting behavior, but if it does it would be nice to use something so standard -- we could probably even drop in a native serializer module for added efficiency.
If modified UTF8 doesn't preserve sort semantics we could probably just stick with the existing encoding routine (though I noticed it also does high-byte escaping the same as buffer, which we can safely eliminate).
For the buffer serialization we'll absolutely need to preserve low and high byte escaping in nested arrays, but we should really be doing this for top level encoded buffers as well for consistency. The escapement procedure we're using today leaves some bits on the table, so it would be nice to fix this up at the same time -- or to find a more standard approach that preserves lexicographical order of buffers.
Two different encodings for strings and buffers was a foolish optimization -- especially the primary use case for
bytewise
is structured (array-based) keys. At the very least we should move to consistently escaping top level strings and buffers. Anyone have any arguments against fixing this?My thinking is, if we're going to go through the trouble of making such a change, we may as well get the escapement right while we're at it. But given the vast majority of bytewise users are already using nested strings and buffers, perhaps the safer course of action would be to start by just start escaping top level strings and buffers consistently with the nested versions. We could offer a flag for back-compat, but I doubt there'll be much demand.
I'd also like to fix the utf8 escaping mistake (where we're escaping
0xff
, which can't exist in a utf8-encoded buffer). This should be completely safe as the only other byte that'd be touched by this process,0xfe
, also can't exist -- so no existing string data should be affected.