cbor / cbor.github.io

cbor.io web site
74 stars 33 forks source link

JSON->CBOR using decimal fractions? #33

Open wojdyr opened 6 years ago

wojdyr commented 6 years ago

I have long arrays of short numbers in JSON. To give a small example:

[3.1, 15.2, -80.6, 40.7]

cbor.me converts it to 37 bytes (more than in the input!):

84                     # array(4)
   FB 4008CCCCCCCCCCCD # primitive(4614162998222441677)
   FB 402E666666666666 # primitive(4624746457346762342)
   FB C054266666666666 # primitive(13858744174572365414)
   FB 404459999999999A # primitive(4630924833085561242)

I think that with decimal fractions the result would be more concise. Is there any ready-to-use converter that automatically uses decimal fractions where it makes sense?

cabo commented 6 years ago

Interesting. First of all you would need a JSON parser that interprets the numbers as decimal, not as binary64 as is usual in the JavaScript (but not necessarily JSON) world. Second, you would need to have a CBOR implementation that preserves decimal numbers. I just noticed that cbor-ruby doesn't do that, but that would be easy to add. Where do you get these short decimal numbers from? (If they are really measurements, you could also convert them to, say 16-bit floating point.)

wojdyr commented 6 years ago

The real numbers are a bit bigger than what I wrote, typically 3-5 digits. They describe 3d structure of macromolecules, and the native format used for these files is not JSON, but it can be converted to JSON. The files look like this: https://files.rcsb.org/view/5GK0.cif (scroll to somewhere in the middle of the file)

Each column in the big table would be a separate array.

The Protein Data Bank has 100,000+ of such files, total gzipped size is > 30GB. When these CIF files are converted to JSON (each column as array) the uncompressed size is about the same, but writing it column-wise makes it compress almost 2x better. I was wondering if converting it to CBOR would further decrease the size. Using json2cbor.rb I get half of the uncompressed JSON size, but after gzipping both the sizes are similar.

Zegnat commented 4 years ago

Bumped into this ticket a little randomly while looking into CBOR for other reasons and thought it was interesting.

I am not sure whether a conversion from JSON is a good way to go, as by default all JSON parsers I know will use floating numbers. But if you were to parse the linked cif file directly you have several options.

Taking a random triplet 8.169, 57.419, 85.998 (atom 3211):

I am not an expert on gzip, but the second decimal fractions option may compress favourably as well because it repeats a number of bytes often: C4 82 22 (translated: tag 4, tuple, exponent -3).