Open Zankoku-Okuno opened 2 years ago
Yes, I think you got it right. "7 LSB" really was meant to mean "use the lowest 7 bits" (leave highest bit -- sign bit -- zero).
And I did indeed drop the ball on clarification. But in practical terms the implementation is indeed the definition: if and when padding is on MSB (and not like spec claims, LSB), that is the Right encoding. I will now go and double-check encoder just to make sure.
Ok, that's great. We're also putting together a set of test vectors, and I accidentally wrote a right-padding encoder on my way to a left-padding one, so I should have all my bases covered in case testing reveals something strange.
After we've got our encoder successfully sending data downstream (to Elastic, which I expect uses the reference implementation), I can send a PR if you'd like.
@Zankoku-Okuno yes. You are right. I tried to improve description (wrt https://github.com/FasterXML/jackson-dataformats-binary/issues/37) , but would appreciate help in better wording.
Ah, now that I've "graduated" from the array encoder to writing the actual number encoder, I notice also that the spec doesn't state whether the byte arrays that encode are big- or little-endian; looking at the Java documentation, I'm assuming big-endian (which is also consistent with the rest of the spec), but it would be good to state this explicitly.
Aha, and here's an interesting point that I had missed in the javadocs and had me going around in circles for a while:
Returns the scale of this BigDecimal. If zero or positive, the scale is the number of digits to the right of the decimal point. If negative, the unscaled value of the number is multiplied by ten to the power of the negation of the scale. For example, a scale of -3 means the unscaled value is multiplied by 1000.
If you remember anything else Java-specific that's made its way into the spec please leave a comment; I'll include it in my clarification PR.
I am not quite sure what you mean by big- vs little-endian wrt byte arrays: endianness is a property of multi-byte values isn't it? Endian-ness should be consistently applied and I think, big-endian (common with IETF network protocols maybe due to Sun legacy).
And yes, encoding of BigDecimal
and BigInteger
from JDK are the most notable Java-specific cases.
I cannot think other dependencies off-hand, except maybe the limitation for length values to be 32-bit (due to Java array length
being similarly limited).
I'm running into some issues determining a precise meaning of the specification for encoding BigInteger and BigDecimal.
The specification states:
Which led me to believe that 7 LSB is a well-known encoding method that I simply hadn't yet heard of. Unfortunately, google searches for
"7 LSB" encoding
(and similar) only turned upDigging deeper led to FasterXML/jackson-dataformats-binary#37, which has been open and afaict essentially untouched for five years now. At this point I have to assume that the implementation has won by default, and that's how we'll be proceeding with our own encoder unless integration testing says otherwise.
Anyway, I figured I'd show up to confirm y'all's notion of safe binary encoding (or 7 LSB, or 7/8 encoding) according to the behavior of the implementation:
Safe binary encoding is an 8-bit clean encoding of arbitrary byte arrays. The byte array is segmented into seven-bit groups from lowest-to-highest index and most-to-least significant bit within each byte. Each group is then padded with leading zero bits until the group is the size of a full octet. Note that this means that each full seven bit group is prepended with a single leading zero bit, but the last group (which may have fewer than seven bits) contains its data in the rightmost (least-significant) bits. (Note: the standard at time of writing specifies that a partial trailing group contains padding in its "rightmost, least-significant bits". I'm honestly not sure which is more "elegant" and don't really have an opinion other than "break the fewest things".)
As an example, the encoding of the array
{0xDE, 0xAD, 0xBE, 0xEF}
is performed as follows: