IBMStreams / administration

Umbrella project for the IBMStreams organization. This project will be used for the management of the individual projects within the IBMStreams organization.
Other
19 stars 10 forks source link

Parquet Toolkit #49

Closed apyasic closed 9 years ago

apyasic commented 10 years ago

Parquet is a columnar storage format for Hadoop. Parquet becoming more and more popular due to its very efficient compression and encoding schemes. See more details at Parquet home page: http://parquet.io/

The Parquet toolkit will allow to read/write data in Parquet format to/from streaming applications. The toolkit is about to be implemented in Java and will contain Sink operator in its initial version.

mikespicer commented 10 years ago

+1 for a parquet toolkit, but with format and parse operators rat6ther than source/sink. On the whole we are trying to separate the data format from the data transport by having separate format and parse operators which can be used with any source and sink rather than combining the format in the sink sink and the parse in the source.

hildrum commented 10 years ago

@mikespicer Our current HDFS source and sink operators don't work well with format and parse, as they cannot write blobs.

chanskw commented 10 years ago

+1... also agree to try to write formatter and parser. For HDFSSource and HDFSSink, there are plans to support blob.

hildrum commented 10 years ago

+1 on the toolkit. I also would prefer Formatter and Parser, too if possible. A Format operator might be possible on the sink end.

However, in my quick look at parquet, I'm not sure that's possible for the source, since the parquet is done in such a way that which parts of the file you read depends on what exactly you want. So, depending on your output type, you may only read certain blocks.

apyasic commented 10 years ago

Mike, Kris and Samantha. Thanks for your comments and proposals! The current ParquetSink operator implementation is based on parquet.hadoop.ParquetWriter API that writes records to a Parquet file as a quickest/easiest option. I will continue to investigate parquet-mr project to figure out the best infrastructure for Formatter and Parser implementation. I would propose to preserve Sink&Source operators as a part of the toolkit too.

petenicholls commented 9 years ago

Created repository streamsx.parquet with Alex as committer. Closing issue.