Optional padding after the length of arrays without index tables?

ColinH commented 5 years ago

For arrays without index tables I can see no benefit in allowing the optional padding after the length. It adds complexity to both the encoder (if it wants to use this feature) and the decoder (which always needs to check for this case) without saving any space in the encoding.

In other words, I doubt that decoding a, say, 16bit integer and then checking for whether there is a zero-byte, and if so then checking for 6 zero-bytes, is noticeably slower than simply decoding a 64bit integer...

If using this padding with value types 0x02, 0x03 and 0x04 effectively does the same as using 0x05 in the first place, could 0x05 not be made the only possibility for encoding a length in 8 bytes?

(It's not like the case of arrays with index tables where the encoding of other parts of the array depends on the size of the encoding of the length of the array.)

jsteemann commented 5 years ago

Not sure if I understand the problem entirely, but the padding in general solves the following problem when building VelocyPack objects/arrays programmatically:

using namespace arangodb::velocypack;

Builder b;
b.openArray();
// until here we don't know how many members there will be in the array, and how big their total size will be

b.add(...);
b.add(...);

// only upon closing we know that there will be no further values, and how big the bytesize of the Array will be
b.close();

Upon opening the array, we always add the padding. And actual payload is added at the start byte + 9 bytes offset. If upon closing the bytesize of the array is less than 2^16, we have two options:

move down the data from start byte + 9 to start byte + 3 (or start byte + 5), we may get a more compact representation of the data. This makes sense for small arrays and if there are lots of them.
don't move down the data, but leave the padding in place. This makes sense to avoid the costs of memmove, which are proportional to the size of its input. For larger arrays, not running memmove just wastes a few bytes of storage, but the relative waste is very low. However, the cost savings of avoiding memmove will be significant.

That latter reason is why there is optional padding allowed for array types.

And as it is always unclear at start (upon opening) how many array members and payload size will follow, the padding is always considered a valid option. We can of course force memmoving the data, so there will never be padding. I agree this would simplify decoding logic, however, wouldn't that penalize larger arrays performance-wise, when there is no choice whether there can be padding or not?

ColinH commented 5 years ago

Just to clarify, I'm suggesting a third option for this case: Leave the array type as 0x05, which indicates an 8-byte size, independent of whether the size would fit in 4, 2 or 1 byte.

I'm not saying I don't see the value of leaving 8 bytes for the size, and then not moving the data; I just don't see the advantage of using, say, type 0x03, which indicates a 2-byte size, with 6 bytes of padding, over plain type 0x05.

jsteemann commented 5 years ago

WDYT about making the padding more controllable in the Builder, e.g. via https://github.com/arangodb/velocypack/pull/39?

ColinH commented 5 years ago

Let's use an example, an array consisting of the single number 1. We have seen that, assuming that the encoder doesn't know the (size of the) total size in advance and leaves 8 bytes for the size before encoding the array elements, there are three natural ways to encode this array without an index table:

1st: 02 03 31  // with memmove

2nd: 02 0a 00 00 00 00 00 00 00 31 // 1-byte size with padding, no memmove

3rd: 05 0a 00 00 00 00 00 00 00 31 // 8-byte size without padding, no memmove

To answer your question, it sounds like a nice idea to make the encoder more flexible! Perhaps with two configuration parameters:

The initial size to leave for the size, i.e. 1, 2, 4 or 8 corresponding to value types 0x02 thru 0x05.
For 2, 4 and 8 byte sizes: What to do when the final size is small enough to fit with a smaller value type.

My suggestion here is unchanged, rather than giving the choice between the 1st and 2nd possibilities I'd give it between the 1st and 3rd, get rid of the size padding for types 0x02 thru 0x04 entirely, and then simplify the decoder accordingly...

Of course either configuration parameter can be implemented independent of the other, and one might want to add a further parameter that says "until size x use memmove, above that use the type 0x05 (or the one that was pre-chosen by the other config parameter).

Does this make sense?

jsteemann commented 5 years ago

I see. I have pushed another commit (https://github.com/arangodb/velocypack/pull/39/commits/ce824eead5fc12ae47a3b37964cadb7832219084) that will dismiss Array types 0x02, 0x03 and 0x04 in favor of using type 0x05 when padding is explicitly requested. That should implement your suggestion, and I don't see any downsides of doing this.

However, for downwards-compatibility the Builder will still be in its original mode, meaning it will use array types 0x02 thru 0x05 as it sees fit. Only by setting the paddingBehavior to UsePadding it will switch to the new behavior. I am a bit against adding much more tuning knobs here, as we will then have to add different parameters for Arrays and non-indexed Arrays etc, which I would like to avoid. I would like to get away with that single paddingBehavior option if possible. WDYT?

ColinH commented 5 years ago

Sounds good, then the padding with 0x02/3/4 and zero-bytes can be deprecated and at some point removed.

However regarding backwards compatibility, the important part is that the padding with 0x02/3/4 and zero-bytes is still recognised and can still be decoded and used, not that it can still be generated?

PS: I didn't want to necessarily recommend adding all possible tuning knobs, I was just wondering out loud what else might make sense in the given context.

jsteemann commented 5 years ago

Yes, our implementation will recognize the 0x02, 0x03, 0x04 and 0x05 formats. I will add yet another validator test to ensure this is also test-covered there.

Regarding generating the "new" format: I am pretty sure other implementations will be able to handle 0x05 types with "small" contents, but I am not 100% sure as I haven't tested most of the other implementations so far. I would like to make sure that at least our Java & golang implementations of VelocyPack can handle the new behavior without problems, and then I am happy to turn the old behavior off by default in our implementation.

arangodb / velocypack

Optional padding after the length of arrays without index tables? #38