More efficient serialization of element's types for collections and maps

GoogleCodeExporter commented 8 years ago

Currently, CollectionSerializer supports methods for setting the class of 
elements and dedicated serializer. But it needs to be invoked explicitly.

By default, CollectionSerializer would write information about a type for each 
element.

But in some cases it is possible to obtain the information about the type of 
elements dynamically using reflection.

For example, the following code 
class ClassWithIntList {
  List<Integer> iList = new ArrayList<Integer>();
  Map<String, Long> iMap;
}

ClassWithIntList o = new ClassWithIntList();
Class<?> iClass = o.getClass();
for(Field field : iClass.getDeclaredFields()) {
    System.out.println(field.getName() + " :" + field.getGenericType());
}

would print:
iList :java.util.List<java.lang.Integer>
iMap :java.util.Map<java.lang.String, java.lang.Long>

Therefore, it is possible to derive static types of collection elements for 
fields. And with this information a more efficient encoding can be used by 
means of writing the type info only once, if the static type of elements is 
final or primitve, which is a very typical use-case.

As example shows, similar approach can be also used for Maps serialization.

Original issue reported on code.google.com by romixlev on 6 Jun 2012 at 10:26

GoogleCodeExporter commented 8 years ago

Agreed, this is interesting.

Original comment by nathan.s...@gmail.com on 6 Jun 2012 at 12:09

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

Note in some cases there is no savings. Eg:
class SomeClass {
   ArrayList<Integer> values;
}
If references=false...
The old way: The class ID (often 1 byte) for each element must be written, or 
zero for null.
The new way: A 0 or 1 byte for each element must be written for each for null.

There are many cases where it is more efficient though. If references=true then 
the null byte is not needed and we save (usually) 1 byte per item in the list.

Original comment by nathan.s...@gmail.com on 6 Jun 2012 at 1:44

GoogleCodeExporter commented 8 years ago

An issue with implementing this is that the serializer registered with Kryo 
can't be used, since we need to configure the serializer for use with a 
specific field by setting the element type. Creating a new serializer is not a 
problem, however currently the only time FieldSerializer caches the serializer 
for a field is when the field's type is final, otherwise the value of the field 
could be a subclass of the field's type. We don't want to create and configure 
a new serializer each time. Should we cache a serializer for the common case 
where the field value's concrete type matches the field's type? Should we keep 
a map or list of serializers for each field as we encounter values of various 
concrete types?

Original comment by nathan.s...@gmail.com on 6 Jun 2012 at 2:01

GoogleCodeExporter commented 8 years ago

Momentarily ignoring how to cache serializers for polymorphism, there is a 
problem with creating a new serializer so it can be customized for a specific 
field. Serializers can be configured and registered with Kryo for a type. 
FieldSerializer needs to make a copy of a configured serializer, configure it 
with generic type information, and use that instead of the original serializer. 
There is currently no mechanism to copy a configured serializer. How would this 
work? A clone method for serializers would make implementing new serializers 
clunky.

I thought about setting generic type info on the original configured 
serializer, using it to serialize the field's value, then clear the generic 
type info. This would work except that a serializer may be reentrant. Eg, an 
ArrayList<ArrayList<Integer>> could use the same CollectionSerializer for all 
the lists. The generic type info could be stored in a local variable at the 
start of serialization so reentrant calls can't corrupt the state, but this 
makes an odd API for serializers that want to access generic type info.

Original comment by nathan.s...@gmail.com on 6 Jun 2012 at 4:15

GoogleCodeExporter commented 8 years ago

I added a Serializer#setGenerics(Type[]) method. This sets the generic types 
that the next call to Serializer read/write can use. It is a little odd to have 
this only apply to the next call, but this solves both needing to cache 
serializers and reentrant serializers. Most serializers won't need to worry 
about setGenerics anyway, and this feature makes for nice savings in some 
situations.

Note that there is only a benefit for generic types that are final. When not 
final, each item in the collection or map could be an instance of a subclass, 
so the type of each item must be written.

Original comment by nathan.s...@gmail.com on 7 Jun 2012 at 6:16

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

This issue was closed by revision r264.

Original comment by nathan.s...@gmail.com on 7 Jun 2012 at 6:18

gaob13 / kryo

More efficient serialization of element's types for collections and maps #68