apache / fury

A blazingly fast multi-language serialization framework powered by JIT and zero-copy.
https://fury.apache.org/
Apache License 2.0
3.12k stars 248 forks source link

[Go] Optimize collection serialization protocol by homogenization #923 #1937

Open chaokunyang opened 2 weeks ago

chaokunyang commented 2 weeks ago

Feature Request

Optimize collection serialization protocol by homogenization. https://fury.apache.org/docs/specification/fury_xlang_serialization_spec/#list is a formulized spec:

All Map serializers must extend AbstractMapSerializer.

Format:

| length(unsigned varint) | key value chunk data | ... | key value chunk data |

map key-value chunk data

Map iteration is too expensive, Fury won't compute the header like for list since it introduce considerable overhead. Users can use MapFieldInfo annotation to provide the header in advance. Otherwise Fury will use first key-value pair to predict header optimistically, and update the chunk header if the prediction failed at some pair.

Fury will serialize the map chunk by chunk, every chunk has 255 pairs at most.

|    1 byte      |     1 byte     | variable bytes  |
+----------------+----------------+-----------------+
| chunk size: N  |    KV header   |   N*2 objects   |

KV header:

If streaming write is enabled, which means Fury can't update written chunk size. In such cases, map key-value data format will be:

|    1 byte      | variable bytes  |
+----------------+-----------------+
|    KV header   |   N*2 objects   |

KV header will be a header marked by MapFieldInfo in java. For languages such as golang, this can be computed in advance for non-interface types most times. The implementation can generate different deserialization code based read header, and look up the generated code from a linear map/list.

Why serialize chunk by chunk?

When fury will use first key-value pair to predict header optimistically, it can't know how many pairs have same meta(tracking kef ref, key has null and so on). If we don't write chunk by chunk with max chunk size, we must write at least X bytes to take up a place for later to update the number which has same elements, X is the num_bytes for encoding varint encoding of map size.

And most map size are smaller than 255, if all pairs have same data, the chunk will be 1. This is common in golang/rust, which object are not reference by default.

Also, if only one or two keys have different meta, we can make it into a different chunk, so that most pairs can share meta.

The implementation can accumulate read count with map size to decide whether to read more chunks.

Is your feature request related to a problem? Please describe

No response

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

923