getty-zig / getty

A (de)serialization framework for Zig
https://getty.so
MIT License
183 stars 14 forks source link

Add support for standard library types #120

Open ibokuri opened 1 year ago

ibokuri commented 1 year ago

General

If you'd like to see a certain std type gain support in Getty, please leave a comment and I'll add it to the hit list.

Also, feel free to work on any of the types listed below. If you have any questions, you can ask them on our Discord or in this issue.

The Hit List

polykernel commented 1 year ago

How should a data structure with multiple possible ways of serialization be serialized? The motivating example is PriorityQueue, one possible serialization is taking elements in the order returned by popping the queue while another is to iterate over the queue with an iterator. Furthermore, for data structures such as EnumMultiset, there is not one obvious serialized form (i.e. a multiset can be serialized as a map or a list).

ibokuri commented 1 year ago

I usually try to follow these rules:

So for PriorityQueue, I'd prefer iterating over it instead of popping values off to avoid modifying the queue.

As for EnumMultiset, serializing it as a sequence seems appropriate to me. I usually think of sets as sequences and the doc comment for EnumMultiset states that it's backed by a dense array.

ibokuri commented 1 year ago

@polykernel, after thinking for a bit, I feel like serializing EnumMultisets as maps (something like {"enum_foo": 1}, where 1 is the number of enum_foos in the set) makes more sense. Logically, serializing them as sequences seems nice but practically speaking that pretty much always just results in a ton of unnecessary tokens and parsing time for everybody.

Thoughts on representing EnumMultisets as maps instead?

polykernel commented 1 year ago

I think it is sensible to represent EnumMultisets as maps by default given multisets are usually represented as maps in practice, but it might be useful in some cases to serialize them as sequences. Perhaps, there could a block specific attribute to control the serialized format but I am not sure if having block specific attributes are desired or scalable.

ibokuri commented 1 year ago

Ahh okay, I haven't worked with multisets often so I wasn't aware that they're usually maps. I'll note that down in the original post.

polykernel commented 1 year ago

I think it is sensible to represent EnumMultisets as maps by default given multisets are usually represented as maps in practice

@ibokuri Sorry, I worded this terribly. By represented as maps, I actually mean implemented as/similarly to maps rather than represented as maps in serialized form. On second thought, I realized I overgeneralized the statement, I know in C++ (at least in libstdc++ and libc++), multiset is implemented like map except the value being stored is the same as the key, but I am definitely not qualified to assess what is usual implementation strategy of multiset is in general.

After some more pondering, I came up with a list comparing the advantages/disadvantages for both seq and map serialization, please let met know if there are points I missed.

# Seq
+ Preserves semantics: a multiset is semantically a type of unordered collection
- Redundant processing: multiplicity information is lost in the process of serialization
  which requires unnecessary processing by the receiving end to recover
- Succinctness: the size of the encoding is proportional to the number of values in the multiset

# Map
+ Succinctness: the size of the encoding is proportional to the number of unique values in the multiset
+ Readability: a key-value mapping is more readable than a sequence with unspecified ordering
- Breaks semantics: a multiset is not semantically equivalent to a map, but rather an unordered
  collection with additional information

Base on the comparison, it seems serializing to maps is the better option. Furthermore, it may be worthwhile to support deserializing from maps as well. I will take a shot at implementing this when I have some time.