FasterXML / smile-format-specification

New home for Smile format (https://en.wikipedia.org/wiki/Smile_(data_interchange_format))
BSD 2-Clause "Simplified" License
92 stars 14 forks source link

Clarification of "Safe" binary encoding #19

Open Zankoku-Okuno opened 2 years ago

Zankoku-Okuno commented 2 years ago

I'm running into some issues determining a precise meaning of the specification for encoding BigInteger and BigDecimal.

The specification states:

"Safe" binary encoding simply uses 7 LSB: data is left aligned (i.e. any padding of the last byte is in its rightmost, least-significant, bits).

Which led me to believe that 7 LSB is a well-known encoding method that I simply hadn't yet heard of. Unfortunately, google searches for "7 LSB" encoding (and similar) only turned up

Digging deeper led to FasterXML/jackson-dataformats-binary#37, which has been open and afaict essentially untouched for five years now. At this point I have to assume that the implementation has won by default, and that's how we'll be proceeding with our own encoder unless integration testing says otherwise.

Anyway, I figured I'd show up to confirm y'all's notion of safe binary encoding (or 7 LSB, or 7/8 encoding) according to the behavior of the implementation:

Safe binary encoding is an 8-bit clean encoding of arbitrary byte arrays. The byte array is segmented into seven-bit groups from lowest-to-highest index and most-to-least significant bit within each byte. Each group is then padded with leading zero bits until the group is the size of a full octet. Note that this means that each full seven bit group is prepended with a single leading zero bit, but the last group (which may have fewer than seven bits) contains its data in the rightmost (least-significant) bits. (Note: the standard at time of writing specifies that a partial trailing group contains padding in its "rightmost, least-significant bits". I'm honestly not sure which is more "elegant" and don't really have an opinion other than "break the fewest things".)

As an example, the encoding of the array {0xDE, 0xAD, 0xBE, 0xEF} is performed as follows:

DE AD BE EF
11011110 10101101 10111110 11101111          // convert to binary
1101111 0101011 0110111 1101110 1111         // segment into 7-bit groups
01101111 00101011 00110111 01101110 1111     // prepend zero bit on each full group
01101111 00101011 00110111 01101110 00001111 // prepend zero padding on the trailing partial group
6F 2B 37 6E 0F                               // convert to octets
cowtowncoder commented 2 years ago

Yes, I think you got it right. "7 LSB" really was meant to mean "use the lowest 7 bits" (leave highest bit -- sign bit -- zero).

And I did indeed drop the ball on clarification. But in practical terms the implementation is indeed the definition: if and when padding is on MSB (and not like spec claims, LSB), that is the Right encoding. I will now go and double-check encoder just to make sure.

Zankoku-Okuno commented 2 years ago

Ok, that's great. We're also putting together a set of test vectors, and I accidentally wrote a right-padding encoder on my way to a left-padding one, so I should have all my bases covered in case testing reveals something strange.

After we've got our encoder successfully sending data downstream (to Elastic, which I expect uses the reference implementation), I can send a PR if you'd like.

cowtowncoder commented 2 years ago

@Zankoku-Okuno yes. You are right. I tried to improve description (wrt https://github.com/FasterXML/jackson-dataformats-binary/issues/37) , but would appreciate help in better wording.

Zankoku-Okuno commented 2 years ago

Ah, now that I've "graduated" from the array encoder to writing the actual number encoder, I notice also that the spec doesn't state whether the byte arrays that encode are big- or little-endian; looking at the Java documentation, I'm assuming big-endian (which is also consistent with the rest of the spec), but it would be good to state this explicitly.

Aha, and here's an interesting point that I had missed in the javadocs and had me going around in circles for a while:

Returns the scale of this BigDecimal. If zero or positive, the scale is the number of digits to the right of the decimal point. If negative, the unscaled value of the number is multiplied by ten to the power of the negation of the scale. For example, a scale of -3 means the unscaled value is multiplied by 1000.

If you remember anything else Java-specific that's made its way into the spec please leave a comment; I'll include it in my clarification PR.

cowtowncoder commented 2 years ago

I am not quite sure what you mean by big- vs little-endian wrt byte arrays: endianness is a property of multi-byte values isn't it? Endian-ness should be consistently applied and I think, big-endian (common with IETF network protocols maybe due to Sun legacy).

And yes, encoding of BigDecimal and BigInteger from JDK are the most notable Java-specific cases. I cannot think other dependencies off-hand, except maybe the limitation for length values to be 32-bit (due to Java array length being similarly limited).