datasalt / pangool

Tuple MapReduce for Hadoop: Hadoop API made easy
http://datasalt.github.io/pangool/
Apache License 2.0
57 stars 13 forks source link

Closing the loop of serialization (lists, sets, maps) #37

Open pereferrera opened 10 years ago

pereferrera commented 10 years ago

Something which remains in my mind is the possibility of closing the loop and making Pangool have all the convenient serialization features, which remain to be : Lists and Maps (being Set a particular case of a Map).

Currently it is possible to serialize them using Avro but the integration code required doesn't look very nice. Pangool could add a wrapper to make this a little nicer - delegating the serialization to Avro - but then it wouldn't be possible to serialize Lists of arbitrary Objects.

While it is true that it wasn't the main idea of Pangool to make it fully serialization-built-in functional, there is no reason why new features which pay off, are easy to implement and make sense with the whole codebase shouldn't be implemented.

What's more, taking a look at the current code, it doesn't seem difficult to add proper built-in serialization support for (typed) Lists or Maps. A custom FieldSerialization could be implemented, which writes the list length first and calls the delegate code in SimpleTupleSerializer for serializing the list typed values.

This would allow for arbitrary typed lists, the type defined by a Pangool's Field (so the method in Field would be something like:

public static Field createListField(String name, Field type)

Therefore it would be possible to serialize lists of lists of lists. Or lists of Tuples. Or anything which is possible due to this recursion.

Opened questions would then be: