Closed apyasic closed 9 years ago
+1 for a parquet toolkit, but with format and parse operators rat6ther than source/sink. On the whole we are trying to separate the data format from the data transport by having separate format and parse operators which can be used with any source and sink rather than combining the format in the sink sink and the parse in the source.
@mikespicer Our current HDFS source and sink operators don't work well with format and parse, as they cannot write blobs.
+1... also agree to try to write formatter and parser. For HDFSSource and HDFSSink, there are plans to support blob.
+1 on the toolkit. I also would prefer Formatter and Parser, too if possible. A Format operator might be possible on the sink end.
However, in my quick look at parquet, I'm not sure that's possible for the source, since the parquet is done in such a way that which parts of the file you read depends on what exactly you want. So, depending on your output type, you may only read certain blocks.
Mike, Kris and Samantha. Thanks for your comments and proposals! The current ParquetSink operator implementation is based on parquet.hadoop.ParquetWriter API that writes records to a Parquet file as a quickest/easiest option. I will continue to investigate parquet-mr project to figure out the best infrastructure for Formatter and Parser implementation. I would propose to preserve Sink&Source operators as a part of the toolkit too.
Created repository streamsx.parquet with Alex as committer. Closing issue.
Parquet is a columnar storage format for Hadoop. Parquet becoming more and more popular due to its very efficient compression and encoding schemes. See more details at Parquet home page: http://parquet.io/
The Parquet toolkit will allow to read/write data in Parquet format to/from streaming applications. The toolkit is about to be implemented in Java and will contain Sink operator in its initial version.