datasalt / pangool

Tuple MapReduce for Hadoop: Hadoop API made easy
http://datasalt.github.io/pangool/
Apache License 2.0
57 stars 13 forks source link

Thrift using different serialization protocols #19

Open ivanprado opened 11 years ago

ivanprado commented 11 years ago

Right now Pangool is serializing thrift using TBinaryProtocol. But could be interesting to use TCompactProtocol, which uses less space. The idea is to make the selection of the protocol configurable.

epalace commented 11 years ago

In the case of map-output this is easy to specify in the Configuration and read by ThriftSerialization. In the case of sequence files containing Thrift objects either in key or value that couldn't be managed directly by SequenceFileInput/OutputFormat. New {Input/Output}Format must be created, and the protocol expected would be specified via Configuration or via SequenceFile Header. In this case ThriftSerialization couldn't be used since with no Objects wrappers a la Avro (AvroKey,AvroValue) it can't distinguish if its an input, map-output or output.

ivanprado commented 11 years ago

It sounds reasonable.

Iván

2013/1/8 Eric Palacios notifications@github.com

In the case of map-output this is easy to specify in the Configuration and read by ThriftSerialization. In the case of sequence files containing Thrift objects either in key or value that couldn't be managed directly by SequenceFileInput/OutputFormat. New {Input/Output}Format must be created, and the protocol expected would be specified via Configuration or via SequenceFile Header. In this case ThriftSerialization couldn't be used since with no Objects wrappers a la Avro (AvroKey,AvroValue) it can't distinguish if its an input, map-output or output.

— Reply to this email directly or view it on GitHubhttps://github.com/datasalt/pangool/issues/19#issuecomment-11992216.

Iván de Prado CEO & Co-founder www.datasalt.com

ivanprado commented 11 years ago

That would be solved properly by implementing a custom field serializer for Thrift (http://pangool.net/userguide/custom_serialization.html). The metadata would be used for storing the format used for serializing this field. This information would be carried as well in the header of the TupleFile.