Open ColinH opened 5 years ago
Not sure if I understand the problem entirely, but the padding in general solves the following problem when building VelocyPack objects/arrays programmatically:
using namespace arangodb::velocypack;
Builder b;
b.openArray();
// until here we don't know how many members there will be in the array, and how big their total size will be
b.add(...);
b.add(...);
// only upon closing we know that there will be no further values, and how big the bytesize of the Array will be
b.close();
Upon opening the array, we always add the padding. And actual payload is added at the start byte + 9 bytes offset. If upon closing the bytesize of the array is less than 2^16, we have two options:
That latter reason is why there is optional padding allowed for array types.
And as it is always unclear at start (upon opening) how many array members and payload size will follow, the padding is always considered a valid option. We can of course force memmoving the data, so there will never be padding. I agree this would simplify decoding logic, however, wouldn't that penalize larger arrays performance-wise, when there is no choice whether there can be padding or not?
Just to clarify, I'm suggesting a third option for this case: Leave the array type as 0x05
, which indicates an 8-byte size, independent of whether the size would fit in 4, 2 or 1 byte.
I'm not saying I don't see the value of leaving 8 bytes for the size, and then not moving the data; I just don't see the advantage of using, say, type 0x03
, which indicates a 2-byte size, with 6 bytes of padding, over plain type 0x05
.
WDYT about making the padding more controllable in the Builder, e.g. via https://github.com/arangodb/velocypack/pull/39?
Let's use an example, an array consisting of the single number 1. We have seen that, assuming that the encoder doesn't know the (size of the) total size in advance and leaves 8 bytes for the size before encoding the array elements, there are three natural ways to encode this array without an index table:
1st: 02 03 31 // with memmove
2nd: 02 0a 00 00 00 00 00 00 00 31 // 1-byte size with padding, no memmove
3rd: 05 0a 00 00 00 00 00 00 00 31 // 8-byte size without padding, no memmove
To answer your question, it sounds like a nice idea to make the encoder more flexible! Perhaps with two configuration parameters:
My suggestion here is unchanged, rather than giving the choice between the 1st and 2nd possibilities I'd give it between the 1st and 3rd, get rid of the size padding for types 0x02 thru 0x04 entirely, and then simplify the decoder accordingly...
Of course either configuration parameter can be implemented independent of the other, and one might want to add a further parameter that says "until size x use memmove, above that use the type 0x05 (or the one that was pre-chosen by the other config parameter).
Does this make sense?
I see. I have pushed another commit (https://github.com/arangodb/velocypack/pull/39/commits/ce824eead5fc12ae47a3b37964cadb7832219084) that will dismiss Array types 0x02, 0x03 and 0x04 in favor of using type 0x05 when padding is explicitly requested. That should implement your suggestion, and I don't see any downsides of doing this.
However, for downwards-compatibility the Builder
will still be in its original mode, meaning it will use array types 0x02 thru 0x05 as it sees fit. Only by setting the paddingBehavior
to UsePadding
it will switch to the new behavior.
I am a bit against adding much more tuning knobs here, as we will then have to add different parameters for Arrays and non-indexed Arrays etc, which I would like to avoid.
I would like to get away with that single paddingBehavior
option if possible.
WDYT?
Sounds good, then the padding with 0x02/3/4 and zero-bytes can be deprecated and at some point removed.
However regarding backwards compatibility, the important part is that the padding with 0x02/3/4 and zero-bytes is still recognised and can still be decoded and used, not that it can still be generated?
PS: I didn't want to necessarily recommend adding all possible tuning knobs, I was just wondering out loud what else might make sense in the given context.
Yes, our implementation will recognize the 0x02, 0x03, 0x04 and 0x05 formats. I will add yet another validator test to ensure this is also test-covered there.
Regarding generating the "new" format: I am pretty sure other implementations will be able to handle 0x05 types with "small" contents, but I am not 100% sure as I haven't tested most of the other implementations so far. I would like to make sure that at least our Java & golang implementations of VelocyPack can handle the new behavior without problems, and then I am happy to turn the old behavior off by default in our implementation.
For arrays without index tables I can see no benefit in allowing the optional padding after the length. It adds complexity to both the encoder (if it wants to use this feature) and the decoder (which always needs to check for this case) without saving any space in the encoding.
In other words, I doubt that decoding a, say, 16bit integer and then checking for whether there is a zero-byte, and if so then checking for 6 zero-bytes, is noticeably slower than simply decoding a 64bit integer...
If using this padding with value types
0x02
,0x03
and0x04
effectively does the same as using0x05
in the first place, could0x05
not be made the only possibility for encoding a length in 8 bytes?(It's not like the case of arrays with index tables where the encoding of other parts of the array depends on the size of the encoding of the length of the array.)