CODARcode / MGARD

MGARD: MultiGrid Adaptive Reduction of Data
Apache License 2.0
37 stars 25 forks source link

Serialization Library Choice #157

Closed ben-e-whitney closed 2 years ago

ben-e-whitney commented 3 years ago

Background

For the past couple of weeks I have been working on the MGARD file format. My original plan was something like this:

  1. Write code to read and write the basic types used in the header (uint32_ts, doubles, etc.). Figure out how to handle endianness and type representations that might differ from one computer to the next.
  2. Work out whether we need to change @JieyangChen7's header in any way to work with the code or to allow for future changes.
  3. Write code to read and write the header. This would, in effect, specify the header.

After reading a bunch about how people handle serialization, I think we should not try to handle this ourselves. Brief argument:

  1. We can avoid the pitfalls we know about (endianness, etc.) by using an existing serialization library.
  2. The existence of so many libraries (partial list below), as well as the libraries' general sophistication, suggests that serialization is not such an easy task, so there are probably pitfalls we aren't even aware of (#155, for example) which we can also avoid by using an existing library.
  3. Some of these libraries use an approach which would be very difficult to implement ourselves but which would, I think, serve our needs well.

This other approach:

  1. Write a specification for each of the header blocks in the format used by a library.
  2. Use the library to generate reading and writing code for each of the blocks from the specification.
  3. Write some code to handle the header logic (what do we expect when this enum takes this value, etc.).

I think this will work well for us. We just have to pick which library to use.

Libraries

Here are some options I've found. As a very rough measure of popularity, I've listed how many stars on GitHub each project has (if it has a repository there). Let me know what looks good to you. I vote for Protocol Buffers. I'll start learning it while I wait to see if anyone objects.

cereal

I played around with library and liked it. However, we probably can't use it because of this:

cereal was not designed to be a robust long term storage solution - it is your responsibility to ensure compatability between saved and loaded cereal archives. It is recommended that you use the same version of cereal for both loading and saving data.

2,900 stars on GitHub.

Protocol Buffers

To use, write .proto files and compile them to C++ reading/writing classes. Other languages supported, too. It appears we can read and write message-by-message, which is good. Lots of thought put into compatibility. Not intended for serialization of arbitrary classes (defines the representable data structures (approximately PODs, including enums)), which is probably a good thing. 51,200 stars on GitHub.

Boost Serialization

Allows for value-by-value (de)serialization and versioning for classes. I imagine it's pretty good since, it's in Boost, but it's a big dependency to add. 83 stars on GitHub (probably means nothing since Boost is split into a lot of small repositories).

s11n

Development seems to be on hold. Quite possibly the released versions would serve our needs, but I suppose we might as well use something getting maintained.

FlatBuffers

Protocol Buffers is indeed relatively similar to FlatBuffers, with the primary difference being that FlatBuffers does not need a parsing/ unpacking step to a secondary representation before you can access data, often coupled with per-object memory allocation. The code is an order of magnitude bigger, too. Protocol Buffers has no optional text import/export.

Seems to emphasize memory efficiency (no separate parsing step). Possibly not a great fit for the incremental approach we might want, but I don't know. 16,900 stars on GitHub.

MessagePack

Seems to be slightly low level compared to other options. Have to manually pack and unpack structs. 5,800 stars on GitHub.

Apache Thrift

Uses the same approach: you write out a schema and compile to get a library you can call. Seems geared towards web development. 8,700 stars on GitHub.

Apache Avro

Seems to be focused on embedding message formats and, like Thrift, web development (RPC). 2,000 stars on GitHub.

Cap'n Proto

Like FlatBuffers, the structures are read directly from memory (no separate parsing step). The author was also involved in writing Protocal Buffers. Comparison with Protobuf, Simple Binary Encoding, and FlatBuffers. 8,500 stars on GitHub.

ben-e-whitney commented 3 years ago

@gqian-coder, it won't let me assign you, for some reason.

JieyangChen7 commented 3 years ago

@ben-e-whitney Thanks for summarizing all these libraries. Although I'm not very familiar with using Protocol Buffers, I think it is a good choice for us since it is well maintained by Google and I know a lot of libraries are using it.