FasterXML / smile-format-specification

New home for Smile format (https://en.wikipedia.org/wiki/Smile_(data_interchange_format))
BSD 2-Clause "Simplified" License
92 stars 14 forks source link

Understanding how Smile encodes 32-bit floats #7

Closed jviotti closed 3 years ago

jviotti commented 3 years ago

I have the following test JSON document which I'm encoding using pysmile.

{
  "tags": [],
  "tz": -25200,
  "days": [ 1, 1, 2, 1 ],
  "coord": [ -90.0715, 29.9510 ],
  "data": [
    { "name": "ox03", "staff": true },
    {
      "name": null,
      "staff": false,
      "extra": { "info": "" }
    },
    { "name": "ox03", "staff": true },
    {}
  ]
}

The encoded result is the following:

00000000: 3a29 0a03 fa83 6461 7461 f8fa 836e 616d  :)....data...nam
00000010: 6543 6f78 3033 8473 7461 6666 23fb fa84  eCox03.staff#...
00000020: 6578 7472 61fa 8369 6e66 6f20 fb83 6e61  extra..info ..na
00000030: 6d65 2184 7374 6166 6622 fbfa 836e 616d  me!.staff"...nam
00000040: 6543 6f78 3033 8473 7461 6666 23fb fafb  eCox03.staff#...
00000050: f981 747a 2406 139f 8364 6179 73f8 c2c2  ..tz$....days...
00000060: c4c2 f984 636f 6f72 64f8 281c 4950 157c  ....coord.(.IP.|
00000070: 2826 373e 0f04 f983 7461 6773 f8f9 fb    (&7>....tags...

From the payload above, the coord float array is encoded like this:

8463 6f6f 7264 f828 1c49 5015 7c28 2637 3e0f 04f9

Based on the spec:

I don't understand how 1c49 5015 7c and 2637 3e0f 04 represent -90.0715 and 29.9510, respectively. The spec says:

Floating point values (IEEE 32 and 64-bit) are encoded using fixed-length big-endian encoding (7 bits used to avoid use of reserved bytes like 0xFF): Data is "right-aligned", meaning padding is prepended to the first byte (and its MSB).

If the floats are encoded as big-endian, then the most significant bytes are 1c and 26, respectively, which means that the 32-bit floats should be encoded as:

Which doesn't add up. Am I missing something?

jviotti commented 3 years ago

OK, I ended up figuring it out by looking at the code:

-90.0715 = 1100 0010101 1010000 1001001 0011100 (0xc2b4249c)
         = First 7 bits = 0011100 = 0x1c = 00011100
         = Next 7 bits (>> 7) = 1001001 = 0x49 = 01001001
         = Next 7 bits (>> 7) = 1010000 = 0x50 = 01010000
         = Next 7 bits (>> 7) = 0010101 = 0x15 = 00010101
         = Next 7 bits (>> 7) = (111)1100 = 0x7c = 01111100

 29.9510 = 0100 0001111 0111110 0110111 0100110 (0x41ef9ba6)
         = First 7 bits = 0100110 = 0x26 = 00100110
         = Next 7 bits (>> 7) = 0110111 = 0x37 = 00110111
         = Next 7 bits (>> 7) = 0111110 = 0x3e = 00111110
         = Next 7 bits (>> 7) = 0001111 = 0x0f = 00001111
         = Next 7 bits (>> 7) =    0100 = 0x04 = 00000100
cowtowncoder commented 3 years ago

@jviotti Apologies for a slow follow up -- I did see the issue but hadn't had time to go back read the spec+code myself.

Do you think this makes sense at this point, wrt code and wording of the spec? I think it is important that not only are things correct but also that explanation/description can be understood by developers.

jviotti commented 3 years ago

@cowtowncoder No worries, I totally understand!

Do you think this makes sense at this point, wrt code and wording of the spec? I think it is important that not only are things correct but also that explanation/description can be understood by developers.

I don't think I would have been able to understand the encoding with just the spec, without looking at the code. That being said, I think that extending the spec with an example would have been more than enough (maybe one based in one of the floats I hand-encoded above?)

cowtowncoder commented 3 years ago

Example would be good idea -- do you think you could do a PR for inclusion in relevant README.md?

jviotti commented 3 years ago

@cowtowncoder I just sent https://github.com/FasterXML/smile-format-specification/pull/8. Let me know what you think!

cowtowncoder commented 3 years ago

Looks good, merged. Thank you for adding the example!