JSON->CBOR using decimal fractions?

wojdyr commented 6 years ago

I have long arrays of short numbers in JSON. To give a small example:

[3.1, 15.2, -80.6, 40.7]

cbor.me converts it to 37 bytes (more than in the input!):

84                     # array(4)
   FB 4008CCCCCCCCCCCD # primitive(4614162998222441677)
   FB 402E666666666666 # primitive(4624746457346762342)
   FB C054266666666666 # primitive(13858744174572365414)
   FB 404459999999999A # primitive(4630924833085561242)

I think that with decimal fractions the result would be more concise. Is there any ready-to-use converter that automatically uses decimal fractions where it makes sense?

cabo commented 6 years ago

Interesting. First of all you would need a JSON parser that interprets the numbers as decimal, not as binary64 as is usual in the JavaScript (but not necessarily JSON) world. Second, you would need to have a CBOR implementation that preserves decimal numbers. I just noticed that cbor-ruby doesn't do that, but that would be easy to add. Where do you get these short decimal numbers from? (If they are really measurements, you could also convert them to, say 16-bit floating point.)

wojdyr commented 6 years ago

The real numbers are a bit bigger than what I wrote, typically 3-5 digits. They describe 3d structure of macromolecules, and the native format used for these files is not JSON, but it can be converted to JSON. The files look like this: https://files.rcsb.org/view/5GK0.cif (scroll to somewhere in the middle of the file)

Each column in the big table would be a separate array.

The Protein Data Bank has 100,000+ of such files, total gzipped size is > 30GB. When these CIF files are converted to JSON (each column as array) the uncompressed size is about the same, but writing it column-wise makes it compress almost 2x better. I was wondering if converting it to CBOR would further decrease the size. Using json2cbor.rb I get half of the uncompressed JSON size, but after gzipping both the sizes are similar.

Zegnat commented 4 years ago

Bumped into this ticket a little randomly while looking into CBOR for other reasons and thought it was interesting.

I am not sure whether a conversion from JSON is a good way to go, as by default all JSON parsers I know will use floating numbers. But if you were to parse the linked cif file directly you have several options.

Taking a random triplet 8.169, 57.419, 85.998 (atom 3211):

The shortest bytes-wise encoding would be to encode them as integers.

If you know Cartn_x, Cartn_y, and Cartn_z (as the values are labelled in the cif) are always expressed with 3 decimals the serialised data file could just encode 8169 and the consumer will have to do the ⨉10⁻³.

Thus in CBOR diagnostic notation: [8169, 57419, 85998], 12 Bytes 83191FE919E04B1A00014FEE.
A longer option, but still shorter than full floats, is to use CBOR’s notation for decimal fractions. This is basically the same thing as writing all the digits as integers, but we embed the ⨉10⁻³ instruction within the CBOR.

In CBOR diagnostic notation this would be having a triplet array where every item is a tuple array with tag 4: [4([-3,8169]), 4([-3,57419]), 4([-3,85998])]. This comes out to 21 Bytes 83C48222191FE9C4822219E04BC482221A00014FEE.

I am not an expert on gzip, but the second decimal fractions option may compress favourably as well because it repeats a number of bytes often: C4 82 22 (translated: tag 4, tuple, exponent -3).

cbor / cbor.github.io

JSON->CBOR using decimal fractions? #33