If schema evolution mode is enabled globally when creating fury, and enabled for current type, type meta will be written
using one of the following mode. Which mode to use is configured when creating fury.
Normal mode(meta share not enabled):
If type meta hasn't been written before, add type def
to captured_type_defs: captured_type_defs[type def] = map size.
Get index of the meta in captured_type_defs, write that index as | unsigned varint: index |.
After finished the serialization of the object graph, fury will start to write captured_type_defs:
Firstly, set current to meta start offset of fury header
Then write captured_type_defs one by one:
buffer.write_var_uint32(len(writting_type_defs) - len(schema_consistent_type_def_stubs))
for type_meta in writting_type_defs:
if not type_meta.is_stub():
type_meta.write_type_def(buffer)
writing_type_defs = copy(schema_consistent_type_def_stubs)
Meta share mode: the writing steps are same as the normal mode, but captured_type_defs will be shared across
multiple serializations of different objects. For example, suppose we have a batch to serialize:
captured_type_defs = {}
stream = ...
# add `Type1` to `captured_type_defs` and write `Type1`
fury.serialize(stream, [Type1()])
# add `Type2` to `captured_type_defs` and write `Type2`, `Type1` is written before.
fury.serialize(stream, [Type1(), Type2()])
# `Type1` and `Type2` are written before, no need to write meta.
fury.serialize(stream, [Type1(), Type2()])
Streaming mode(streaming mode doesn't support meta share):
If type meta hasn't been written before, the data will be written as:
| unsigned varint: 0b11111111 | type def |
If type meta has been written before, the data will be written as:
| unsigned varint: written index << 1 |
written index is the id in captured_type_defs.
With this mode, meta start offset can be omitted.
The normal mode and meta share mode will forbid streaming writing since it needs to look back for update the start
offset after the whole object graph writing and meta collecting is finished. Only in this way we can ensure
deserialization failure in meta share mode doesn't lost shared meta.
Type Def
Here we mainly describe the meta layout for schema evolution mode:
| 8 bytes meta header | variable bytes | variable bytes | variable bytes |
+-------------------------------+--------------------+-------------------+----------------+
| 7 bytes hash + 1 bytes header | current type meta | parent type meta | ... |
Type meta are encoded from parent type to leaf type, only type with serializable fields will be encoded.
Meta header
Meta header is a 64 bits number value encoded in little endian order.
Lowest 4 digits 0b0000~0b1110 are used to record num classes. 0b1111 is preserved to indicate that Fury need to
read more bytes for length using Fury unsigned int encoding. If current type doesn't has parent type, or parent
type doesn't have fields to serialize, or we're in a context which serialize fields of current type
only, num classes will be 1.
The 5th bit is used to indicate whether this type needs schema evolution.
Other 56 bits are used to store the unique hash of flags + all layers type meta.
Single layer type meta
| unsigned varint | var uint | field info: variable bytes | variable bytes | ... |
+-----------------+----------+-------------------------------+-----------------+-----+
| num_fields | type id | header + type id + field name | next field info | ... |
num fields: encode num fields as unsigned varint.
If the current type is schema consistent, then num_fields will be 0 to flag it.
If the current type isn't schema consistent, then num_fields will be the number of compatible fields. For example,
users can use tag id to mark some fields as compatible fields in schema consistent context. In such cases, schema
consistent fields will be serialized first, then compatible fields will be serialized next. At deserialization,
Fury will use fields info of those fields which aren't annotated by tag id for deserializing schema consistent
fields, then use fields info in meta for deserializing compatible fields.
type id: the registered id for the current type, which will be written as an unsigned varint.
field info:
header(8
bits): 3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag.
Users can use annotation to provide those info.
If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be 11.
size of field name:
The 3 bits size: 0~7 will be used to indicate length 1~7, the value 7 indicates to read more bytes,
the encoding will encode size - 7 as a varint next.
If encoding is TAG_ID, then num_bytes of field name will be used to store tag id.
ref tracking: when set to 1, ref tracking will be enabled for this field.
nullability: when set to 1, this field can be null.
polymorphism: when set to 1, the actual type of field will be the declared field type even the type if
not final.
field name: If tag id is set, tag id will be used instead. Otherwise meta string encoding [length] and data will
be written instead.
type id:
For registered type-consistent classes, it will be the registered type id.
Otherwise it will be encoded as OBJECT_ID if it isn't final and FINAL_OBJECT_ID if it's final. The
meta for such types is written separately instead of inlining here is to reduce meta space cost if object of
this type is serialized in current object graph multiple times, and the field value may be null too.
Field order are left as implementation details, which is not exposed to specification, the deserialization need to
resort fields based on Fury field comparator. In this way, fury can compute statistics for field names or types and
using a more compact encoding.
Other layers type meta
Same encoding algorithm as the previous layer.
Is your feature request related to a problem? Please describe
Feature Request
If schema evolution mode is enabled globally when creating fury, and enabled for current type, type meta will be written using one of the following mode. Which mode to use is configured when creating fury.
Normal mode(meta share not enabled):
type def
tocaptured_type_defs
:captured_type_defs[type def] = map size
.captured_type_defs
, write that index as| unsigned varint: index |
.captured_type_defs
:meta start offset
of fury headerThen write
captured_type_defs
one by one:Meta share mode: the writing steps are same as the normal mode, but
captured_type_defs
will be shared across multiple serializations of different objects. For example, suppose we have a batch to serialize:Streaming mode(streaming mode doesn't support meta share):
If type meta hasn't been written before, the data will be written as:
If type meta has been written before, the data will be written as:
written index
is the id incaptured_type_defs
.With this mode,
meta start offset
can be omitted.Type Def
Here we mainly describe the meta layout for schema evolution mode:
Type meta are encoded from parent type to leaf type, only type with serializable fields will be encoded.
Meta header
Meta header is a 64 bits number value encoded in little endian order.
0b0000~0b1110
are used to record num classes.0b1111
is preserved to indicate that Fury need to read more bytes for length using Fury unsigned int encoding. If current type doesn't has parent type, or parent type doesn't have fields to serialize, or we're in a context which serialize fields of current type only, num classes will be 1.flags + all layers type meta
.Single layer type meta
num fields
as unsigned varint.0
to flag it.3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag
. Users can use annotation to provide those info.UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID
11
.3 bits size: 0~7
will be used to indicate length1~7
, the value7
indicates to read more bytes, the encoding will encodesize - 7
as a varint next.TAG_ID
, then num_bytes of field name will be used to store tag id.final
.[length]
and data will be written instead.OBJECT_ID
if it isn'tfinal
andFINAL_OBJECT_ID
if it'sfinal
. The meta for such types is written separately instead of inlining here is to reduce meta space cost if object of this type is serialized in current object graph multiple times, and the field value may be null too.Field order are left as implementation details, which is not exposed to specification, the deserialization need to resort fields based on Fury field comparator. In this way, fury can compute statistics for field names or types and using a more compact encoding.
Other layers type meta
Same encoding algorithm as the previous layer.
Is your feature request related to a problem? Please describe
No response
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
1556