Open ibokuri opened 1 year ago
How should a data structure with multiple possible ways of serialization be serialized? The motivating example is PriorityQueue
, one possible serialization is taking elements in the order returned by popping the queue while another is to iterate over the queue with an iterator. Furthermore, for data structures such as EnumMultiset
, there is not one obvious serialized form (i.e. a multiset can be serialized as a map or a list).
I usually try to follow these rules:
So for PriorityQueue
, I'd prefer iterating over it instead of popping values off to avoid modifying the queue.
As for EnumMultiset
, serializing it as a sequence seems appropriate to me. I usually think of sets as sequences and the doc comment for EnumMultiset
states that it's backed by a dense array.
@polykernel, after thinking for a bit, I feel like serializing EnumMultiset
s as maps (something like {"enum_foo": 1}
, where 1
is the number of enum_foo
s in the set) makes more sense. Logically, serializing them as sequences seems nice but practically speaking that pretty much always just results in a ton of unnecessary tokens and parsing time for everybody.
Thoughts on representing EnumMultiset
s as maps instead?
I think it is sensible to represent EnumMultiset
s as maps by default given multisets are usually represented as maps in practice, but it might be useful in some cases to serialize them as sequences. Perhaps, there could a block specific attribute to control the serialized format but I am not sure if having block specific attributes are desired or scalable.
Ahh okay, I haven't worked with multisets often so I wasn't aware that they're usually maps. I'll note that down in the original post.
I think it is sensible to represent EnumMultisets as maps by default given multisets are usually represented as maps in practice
@ibokuri Sorry, I worded this terribly. By represented as maps, I actually mean implemented as/similarly to maps rather than represented as maps in serialized form. On second thought, I realized I overgeneralized the statement, I know in C++ (at least in libstdc++ and libc++), multiset is implemented like map except the value being stored is the same as the key, but I am definitely not qualified to assess what is usual implementation strategy of multiset is in general.
After some more pondering, I came up with a list comparing the advantages/disadvantages for both seq and map serialization, please let met know if there are points I missed.
# Seq
+ Preserves semantics: a multiset is semantically a type of unordered collection
- Redundant processing: multiplicity information is lost in the process of serialization
which requires unnecessary processing by the receiving end to recover
- Succinctness: the size of the encoding is proportional to the number of values in the multiset
# Map
+ Succinctness: the size of the encoding is proportional to the number of unique values in the multiset
+ Readability: a key-value mapping is more readable than a sequence with unspecified ordering
- Breaks semantics: a multiset is not semantically equivalent to a map, but rather an unordered
collection with additional information
Base on the comparison, it seems serializing to maps is the better option. Furthermore, it may be worthwhile to support deserializing from maps as well. I will take a shot at implementing this when I have some time.
General
If you'd like to see a certain
std
type gain support in Getty, please leave a comment and I'll add it to the hit list.Also, feel free to work on any of the types listed below. If you have any questions, you can ask them on our Discord or in this issue.
The Hit List