This is a prototype suggestion to use the protobuf format for serializing this data.
By serializing to protocol buffers, we can gain a lot of compression.
For example, I tested this on the canadian allocations data and found that the protobuf format is 28.3% the size of pure json. The drawback of using protobufs is that since we're stripping the field names (hence most of the compression) we have to store their definitions elsewhere through a protobuf file.
I don't think right now we're limited by space, but it's something worth considering and documenting as we're designing the API and exploring our options.
Encoding Method
Size (bytes)
Percent size relative to one line JSON
JSON, no indentation, no newline
144550
100%
JSON, new line only
150766
104.3%
JSON, new line + 1 space indentation
199154
138%
JSON, new line + 2 space indentation
247542
173%
protobuf
40919
28.3%
protobuf B64 encoded
54560
37.7%
Pros and Cons for using protobuf
Pros
significantly smaller footprint
we have one universal definition file defining the data. there is no ambiguity or language specific choices. It should be noted that JSON + JSON Schema is also language agnostic. However, the definition files can tend to be harder to read.
protos are forwards and backwards compatible. this makes it easy to evolve the API
tools exist to convert these proto files into typed models into the most common languages
Cons
the data is no longer self-contained; we need to refer to a proto file to understand it
the data is not longer easy to read
model changes require changing the proto definitions which makes making changes to the data slightly more inconvenient.
you need an additional library to read them. For JS, this requires a 788KB unpacked library file
here. This is on top of the generated js code which for this data ends up approximately taking 35KB.
Description
This is a prototype suggestion to use the protobuf format for serializing this data. By serializing to protocol buffers, we can gain a lot of compression.
For example, I tested this on the canadian allocations data and found that the protobuf format is 28.3% the size of pure json. The drawback of using protobufs is that since we're stripping the field names (hence most of the compression) we have to store their definitions elsewhere through a protobuf file. I don't think right now we're limited by space, but it's something worth considering and documenting as we're designing the API and exploring our options.
Pros and Cons for using protobuf
Pros
Cons
The proto definition