FasterXML / jackson-dataformats-binary

Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile
Apache License 2.0
316 stars 136 forks source link

SMILE format specification - bug in "safe binary" encoding #37

Open SheliakLyr opened 7 years ago

SheliakLyr commented 7 years ago

SMILE specification contains the following statements:

  • "Big" decimal/integer values use "safe" binary encoding
  • "Safe" binary encoding simply uses 7 LSB: data is left aligned (i.e. any padding of the last byte is in its rightmost, least-significant, bits).

Let's consider a simple array: Array(0x01). After encoding it to 7LSB it will use 2 bytes. According to the specification "any padding of the last byte is in its rightmost, least-significant, bits", so it should look like this:

_0000000 _1pppppp   (hex: 0x0040)
where:
_ - unused byte (0)
p - padding (0)

However, the jackson library behaves differently. The following code (Scala) encodes a BigInteger (1).

object SmileTestDataGenerator {
  def main(args: Array[String]) {
    val sf = new SmileFactory()
    val os = new ByteArrayOutputStream()
    val gen = sf.createGenerator(os)
    gen.writeNumber(BigInteger.valueOf(1))
    gen.close()
    println(DatatypeConverter.printHexBinary(os.toByteArray))
  }
}

Output:

3A290A0126810001

After removing header/type token/content length we are left with 0001! I've discovered that padding is located in the last byte but in its LEFT-MOST, most-significant bits. So an array(0x01) is encoded to: _0000000 pppppp_1

Am I right? Is the specification wrong or is it a bug in implementation?

cowtowncoder commented 7 years ago

@SheliakLyr First of all, thank you for reporting this problem. It sounds like a very important thing to resolve.

I will have to double-check this to make absolutely sure; both with respect to discrepancy between implementation and spec, and to verify whether there is difference between different parts of implementation (BigDecimal, BigInteger and byte[]).

My first thought is that the explanation in spec is ambiguous with respect to alignment across bytes (in which byte is padding) vs alignment within byte (which bits contain padding), and that implementation works as intended, but explanation is not good. If so, it should clearly state that within padded byte, most-significant bit is not used, but that padding is in Least-Significant BIts.

I hope this is the case (and as I said, I need to read it carefully). If there is real disagreement we have a more significant issue and if so must consider backwards compatibility.

cowtowncoder commented 2 years ago

@SheliakLyr your finding is correct and the specification incorrectly claims padding is in Least-Significant Bits, but the implementation pads MSB instead. I will need to update specification to correct this mistake.

cowtowncoder commented 2 years ago

Ok I updated:

https://github.com/FasterXML/smile-format-specification/blob/master/smile-specification.md

with what should be more correct description. I am not entirely happy with the description so any PRs there would be appreciated.

Still, the behavior of implementation, as documented in spec v1.0.5 is the correct one due to practical reasons (obviously having had incorrect specification for years is not ideal).