Make serialized representation produced by FieldSerializer more compact for fields of non-final types

GoogleCodeExporter commented 8 years ago

Currently, FieldSerializer handles fields of non-final types in a following way:
- it uses ObjectField for such fields
- it dynamically obtains a serializer for a value of the field
- it uses writeClass() to explicitly write a class id before serializing value 
of the field

This approach works, but I think in many cases writing class id before all 
fields of non-final types could be eliminated and thus save space. For example, 
if a field is of type java.util.Map, it usually would use the same serializer 
(e.g. MapSerializer), even though the type is not final.

Here is a description of my proposal:
- For each field remember the valueClass even if it is non-final
- During serialization, obtain a serializer based on a real value of the field. 
Compare it with a serializer which corresponds to the remembered valueClass.
- If both serializers are the same, there is no need to output an explicit type 
id using writeClass(). Instead a special flag is written (e.g. a single byte 
value, which is not a valid class id. May be 0 can be used).
- If serializers differ, then a class id of a serializer for a real value is 
written as usual.
- The serializer for a real value of the field is invoked as usual.

- During deserialization, when reading a value of a field of non-final type, 
read the class id. If it is a special value (e.g. 0 proposed above), it 
indicates that serializer is the same as a statically known. If it is a valid 
class id, then a corresponding serializer is first looked up.
- Obtained serializer is used as usual.

Benefits:
- At most one byte is used for encoding types of fields which have non-final 
types, if field's value real type uses the same serializer as this field's 
declared non-final type. And this probably covers the majority of cases.
- The same encoding as before is used for types of fields which have non-final 
types, if field's value real type uses different serializer than this field's 
declared non-final type

Possible optimizations that result in an even smaller serialized 
representations:
- Since we only need a single boolean to indicate that field's value real type 
uses the same serializer as field's declared non-final type, we could use a 
trick here. We could output a byte (or long or a bitset in a general case) as a 
header before serialized representation of an object. Each bit "i" in this 
bitset represents the flag which indicates that field's value real type uses 
the same serializer as field's declared non-final type. If it is set to 1, then 
serializers are the same. Otherwise they are different. 
- During serialization, a header is computed using the rules described in the 
previous paragraph and output 
before the object. There is no need now to output a special flag (e.g. 0) 
before each field where field's value real type uses the same serializer as 
field's declared non-final type.
- During deserialization, the header is read and then read() method know when 
it needs to read a class id and where it can skip this step, because the 
(de)serializer is already (statically) known.

What do you think about this proposal?

-Leo

Original issue reported on code.google.com by romixlev on 21 Aug 2013 at 4:22

GoogleCodeExporter commented 8 years ago

It's a neat idea.

The largest savings comes from avoiding writing class name strings. This would 
be a good improvement. Using class name strings at all is a suboptimal path, 
but sometimes can't be avoided. The first byte DefaultClassResolver writes is 
either for null, to denote the next bytes are a class name, or a class ID. We 
could add another value to denote if the default serializer should be used.

This feature should be handled by the framework, I don't think it needs to 
affect serializers.

Using a bitset header to make serialization more efficient could be nice, 
though it complicates the serializer implementation.

Recently FieldSerializer's complexity has increased quite a lot. I think it 
should be made easier to follow if possible, especially before adding new, 
complex features. I would love to see functionality separated out of the 1000+ 
line class if it makes sense. Possibly there could be a utility class that 
deals with generics. Maybe another that deals with cached fields. Maybe it 
makes sense to have a separate cached field utility for unsafe. Maybe 
FieldSerializer has an instance of these classes and they are configured 
directly, or maybe multiple FieldSerializer-like classes is better. What do you 
think?

Original comment by nathan.s...@gmail.com on 22 Aug 2013 at 7:20

GoogleCodeExporter commented 8 years ago

> This feature should be handled by the framework, I don't think it needs to 
affect serializers.

It depends. If you consider those serializers which were registered using 
addDefaultSerializer or setDefaultSerializer as default ones, then yes, we 
don't need to affect serializers. But if we have something like an instance of 
a  map serializer with default keyType or valueType set explicitly for it (e.g. 
as a guess based on probing the first entry and/or by inspecting generic type 
parameters), then "default" means a serializer specific setting and I'm not 
sure that it can be handled completely by the framework without involving some 
logic inside a serializer. But this is just a feeling at the moment. I need to 
get my hands dirty with the implementation to better understand it.

> Using a bitset header to make serialization more efficient could be nice, 
though it complicates the 
> serializer implementation.

True. That's why I listed it a possible improvement, but not as a main 
embodiment of the idea. It may require "backpatching" of the header after all 
fields are processed, which could be difficult if the whole object does not fit 
into a buffer.

Regarding the complexity of the FieldSerializer: 
I totally agree that it is way too complex now. I could be a good idea to split 
it into parts, each handling its own feature (e.g. generics, cached fields, 
unsafe, etc). I'll see what can be done.

Original comment by romixlev on 22 Aug 2013 at 7:51

GoogleCodeExporter commented 8 years ago

I did a first step in re-factoring FieldSerializer. A lot of generics-related 
and unsafe-related logic is moved into dedicated classes. FieldSerializer is 
only about 620 lines of code now. All unit tests are still green. The code is 
already committed.

Original comment by romixlev on 22 Aug 2013 at 11:27

GoogleCodeExporter commented 8 years ago

I have some progress on a more compact representation - the simple version 
described in the proposal seems to work fine for FieldSerializer.

I'm looking now into a way how we handle references. Currently, we write one 
byte for each new object. In principle, it could be avoided. We always know 
when a new object starts and therefore we don't need a special marker for it. 
Instead, we could write a special marker followed by the number of the object 
only when we really refer to a previously seen object. 

Together, it would result in something like:
- Each object of a primitive class is written without any headers
- If we have an object whose real class is the same as its declared class, we 
just write a special tag to indicate this
- if we have an object whose real class is different from its declared class, 
we write a special tag indicating that a class id is following
- If we refer to a previously seen object, we write a special tag indicating 
that it is a reference and then the number of the object

When we deserialize, at any place where we expect an object we check the first 
byte which contains a tag:
- if it is a reference tag, then it is a reference to a previously seen object
- if it is a special tag used to indicate that a real class is the same as the 
expected/declared class, then we know that we read an object of the expected 
class
- if it is a class id tag, we read a real class of the object after it

The difference from the current approach is:
- We do not output a dedicated "reference tag" byte before each object
- We output a class id tag before each class id, which is written (one could 
say that we reuse the reference tag byte from a previous bullet for a different 
purpose)
- If class id is the same as expected/declared class, only a single byte tag is 
written, without writing a real class id
- We output a dedicated "reference tag" before each real reference inside 
object graph

As far as I can see, the old approach always produces a representation which is 
at least as long as a new proposed representation.

Original comment by romixlev on 21 Oct 2013 at 10:13

GoogleCodeExporter commented 8 years ago

When we deserialize, when we expect an object, if no tag was written how do we 
know if they byte we read is a tag or data? Are only primitives (and strings) 
written without a tag?

Most class IDs (0 to 127) are 1 byte. It seems with the proposal we write out 
fewer bytes in some cases and more bytes in others. Not writing bytes for 
primitives would likely be worthwhile.

Original comment by nathan.s...@gmail.com on 21 Oct 2013 at 12:42

GoogleCodeExporter commented 8 years ago

> When we deserialize, when we expect an object, if no tag was written how do 
we know if they byte 
> we read is a tag or data? Are only primitives (and strings) written without a 
tag?

Yes. This is the idea. Only fields of primitive types don't have a tag. But we 
know this statically, so we we know that we expect data at this place.

> Most class IDs (0 to 127) are 1 byte. It seems with the proposal we write out 
fewer bytes in some 
> cases and more bytes in others. 

I'd say that in case when references enabled, the new proposal is strictly 
better in most cases (because we don't write a reference byte before each 
object, but write a byte before each class id, i.e. we don't introduce any 
additional bytes). In case, where we don't use references, it could be that the 
old approach is better, because we don't write any reference tags at all and we 
don't write any additional tags for class ids.

> Not writing bytes for primitives would likely be worthwhile.

Sure. But we don't write them even now, or? I.e. we don't write any tags for 
ints, floats, etc. And we don't write any tags for fields declared using final 
classes anyway. So, which use-cases do you have in mind here?

Original comment by romixlev on 21 Oct 2013 at 2:52

GoogleCodeExporter commented 8 years ago

> So, which use-cases do you have in mind here?
Sorry, you're right. I was confused trying to figure out how the proposal is 
better. How exactly is the proposal better? What does writing a class ID tag 
buy us if we assume a class ID is 1 byte (which is common)?

Writing more bytes when references are disabled isn't great. So far that has 
been the most efficient mode, both in speed and size.

Original comment by nathan.s...@gmail.com on 21 Oct 2013 at 3:08

GoogleCodeExporter commented 8 years ago

As I mentioned, it brings some obvious advantages in case when references are 
enabled. But in case when they are disabled and we have only 1-byte class ids, 
it would not bring any benefits. 
So, we should probably use it only when references are enabled, if at all.

Or may be we can avoid the explicit byte for a class id tag . We can just treat 
class ids as a special kind of tags. The first few ids are reserved as special 
tags (e.g. REFERENCE tag, SAME AS DECLARED CLASS tag, etc). All others values 
are simply treated as class ids. This is similar to what we do now. The 
difference will be that we don't write reference tag byte before each object. 
We would write it only  when we really refer to an object. And we don't write 
an explicit class id, if the real class is the same as a declared class.  Now 
it seems to be at least as efficient as our current approach or better in case 
when references are enabled. I think it covers all cases now, but I need to 
think a bit more about it.

Original comment by romixlev on 21 Oct 2013 at 3:23

GoogleCodeExporter commented 8 years ago

The main improvement would be then that we use the same byte to indicate a 
reference tag and class ids, which is different from our current approach where 
we use separate bytes for it. It would also mean that we would need  to 
read/write references and class ids using a single method, because it would 
need to read/write a byte and treat it as a reference or class id depending on 
its value.

Original comment by romixlev on 21 Oct 2013 at 3:35

GoogleCodeExporter commented 8 years ago

> As I mentioned, it brings some obvious advantages in case when references are 
enabled.

It still isn't clear to me what the advantages are. You haven't explicitly 
stated them.

> The difference will be that we don't write reference tag byte before each 
object. We would write it only when we really refer to an object.

Not sure what you mean here, sorry. When references are on, we currently write 
a varint that both indicates null/not null and the reference ID.

Original comment by nathan.s...@gmail.com on 21 Oct 2013 at 3:39

GoogleCodeExporter commented 8 years ago

>> The difference will be that we don't write reference tag byte before each 
object. We would write it 
>> only when we really refer to an object.

> Not sure what you mean here, sorry. When references are on, we currently 
write a varint that both 
>indicates null/not null and the reference ID.

Currently, when references are on, we write a varint (single byte) before EACH 
object we see for the first time, just to mark it. We do it even if this object 
is never referenced later. The new proposal avoids doing it. It only writes a 
dedicated tag (single byte) and a reference ID (varint) every time a real 
reference is discovered. Since in most cases no real references are discovered 
in the object graph, we save a byte per object for almost all objects.

Original comment by romixlev on 21 Oct 2013 at 4:27

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Got it, thanks. The reference ID we currently write can also mean null/not 
null, so writeObjectOrNull won't be able to avoid writing a byte before the 
object data when serializer.getAcceptsNull() is false. A number of serializers 
accept null, but FieldSerializer doesn't (for now, maybe in the future it uses 
a bitset for nulls), which means most objects need a null denoting byte. This 
limits the byte savings from the proposal. Still, I like the idea and if we can 
save bytes for serializers that accept null, that would be great. However, I 
worry about how it might impact the API, since a method like writeClass 
currently doesn't need to know about references. Having Kryo handle references 
relatively transparently (there is just the Kryo#reference() magic) is nice 
API-wise.

Original comment by nathan.s...@gmail.com on 21 Oct 2013 at 11:00

GoogleCodeExporter commented 8 years ago

Yet another improvement which I already implemented in my local branch is like 
this:
- If registration of classes is not required by a Kryo instance, we currently 
output a class name and assigned class ID, when we see this class for the first 
time and later we output 2 bytes, every time we see this class. First byte has 
a value of NAME (which indicates that it is a non-registered class) and the 
second byte (or more precisley varint) contains the id of this unregistered 
class. 

- What I did is to use positive and negative numbers as class ids. If class id 
is positive, then it is a usual registred class ID. But if we need to encode a 
non-registered class, we simply use a negative id.

- This approach is able to cover most of the typical cases using only 1 byte, 
independent of the fact if a given class is registered or not, which is an 
improvement.

- Small disadvantage: since it uses one bit for a sign, it can now encode in 
one byte only a half of registered classes, which could be encoded with the 
current approach, i.e. only 64 classes.

Original comment by romixlev on 31 Oct 2013 at 4:32

GoogleCodeExporter commented 8 years ago

Kryo was originally optimized for a networking protocol, so registering all 
classes, not using references, etc. Keeping the number of registered class IDs 
that can be representing with 1 byte was more important than not using 
registration, which is already not a fast path since we are writing class 
names. I could go either way on your change, though I lean a bit toward not 
optimizing when registration is disabled if it affects the faster paths.

Original comment by nathan.s...@gmail.com on 31 Oct 2013 at 8:51

Candyjing1024 / kryo

Make serialized representation produced by FieldSerializer more compact for fields of non-final types #127