Proposed addition to the streamsx.hdfs

hildrum commented 9 years ago

I have a new reader for HDFS files does parallel reads of text files, splittable compressed text files, and sequence files. It does this by making use of the InputFormat interface, so it we should be able to add extensions as need to handle anything extending from InputFormat. I hope to add support for parquet files in the next week or so.

One complication is that we already have a streamsx.parquet repository containing a parquet writer. The way I plan to handle the parquet reader, it will be most natural to keep it in the same operator as the text and sequence file reader, so this means there will be a parquet reader in the streamsx.hdfs toolkit and not the streamsx.parquet toolkit.

I know the usual policy is to bring up additions to a specific repository in that repository and not in administration, but I wanted to give everyone a chance to comment if they don't think streamsx.hdfs is the right place.

mikespicer commented 9 years ago

In General keeping the formatting and parsing in separate operators allows any source or sink to read/write any format. However, I think its OK to handle the format/parsing in the source/sink if there is some significant performance gain.

I think that the parquet toolkit actually includes the source/sink rather than just format and parse logic, which is unfortunate. Could you handle your needs with parquet parse/format operators in the parquet toolkit and a convenience composite which uses those or do you need to have them in the same operator for some reason?

hildrum commented 9 years ago

This operator I've prototyped is build using the InputFormat interface. The operator creates an InputFormat object, configures it, and then generates input splits. For each input split, it generates a RecordReader object, which then can be used to iterate over the records in the file. Using these APIs, there isn't really an ability to separate format and parse logic from the reading and writing.

The nice thing about this approach is that it is making use of the same APIs used for input to map-reduce jobs, so we get the ability to handle many formats and parallelism without much work on our end.

hildrum commented 9 years ago

There were no objections, so I'm adding it to streamsx.hdfs. Pull request: https://github.com/IBMStreams/streamsx.hdfs/pull/52

chanskw commented 9 years ago

Shouldn't the usual process be that we should assume consensus when we get enough +1 votes? Silence could mean that people just haven't had a chance to look at this, which is my case. :) And sorry about the delayed response.

I am uneasy that the parquet writer is in the parquet toolkit and then the reader is in the HDFS toolkit. It makes it very confusing for the end-user. Why do you want to contribute it to HDFS and not the Parquet toolkit? Is it because it can support more general formats than Parquet?

If we are having some generalized format support in HDFS support, and Parquet is in the roadmap. Should we combine the Parquet toolkit into the HDFS toolkit?

The implication here is that: 1) We need to move the parquet writer to the HDFS toolkit 2) We need to refactor the writer code to follow the design of other HDFS operators in the HDFS toolkit. i.e. making use of our HDFS client, and handling connection in a consistent manner. In my understanding, using a different interface to read data, should not affect how we connect to the system.

So, my proposal is this: a) if we are adding this new operator in HDFS toolkit and eventually support Parquet, then we should sunset the Parquet toolkit and move that operator to the HDFS toolkit b) we should have a consistent way of connecting to HDFS system. Currently, the new writer and reader connects to HDFS differently from the rest of the HDFS toolkit.

Please vote.

cancilla commented 9 years ago

+1 on Samantha's comment. A lack of response does not imply overwhelming support.

I am concerned about adding new operators to the toolkit just to support different read/write formats and mechanisms. If this functionality is required, we need to make the effort to integrate that functionality into the existing operators. If that's not possible, then at the very least we need to ensure that the external interface of any new operators are consistent with existing operators.

hildrum commented 9 years ago

The process as I understand it and as documented in the wiki specifies that new toolkits are discussed here, but that additions to existing toolkits are discussed in their respective repositories. Since this was an addition to an existing toolkit, if I were to follow the process strictly, I would not even have created this issue.

But I thought that this might need broader comment than the streamsx.hdfs readers, so I created this issue. Since this was just a heads up on the new operator and not a repository proposal, I didn't think a vote was necessary.

ddebrunner commented 9 years ago

+1 for @hildrum asking the question here and proceeding to commit.

+1 for discussing any changes to the new HDFS operator(s?) in streamsx.hdfs project

-0.25 on changes to the parquet toolkit, don't really have enough background knowledge, but I think it's a fact of life there will be specialized format reader/writer operators that align better with their data source (e.g. HDFS) while there still being a need for operators that handle the format generically. It may be better to let the operators/toolkits evolve for a while before rushing to merge and deprecate. Maybe there is a need for Parquet operators that just work on the data format, which wouldn't make sense in hdfs.

apyasic commented 9 years ago

Dear community. Sorry for delay in response. That’s true that Parquet format currently supported in Hadoop only. Yet, there are proposals to decouple Parquet from Hadoop API and be able to read Parquet files outside of map reduce or Hadoop. So, I tend to agree with @ddebrunner comment about giving chance to streamsx.parquet toolkit to evolve for a while and see if separation above indeed happen. Another reason to keep the toolkit as a separate one is that the toolkit eventually intended to support object model converters provided by parquet-mr GitHub project. The converters are responsible for mapping between external object models and Parquet’s internal data types (object models examples includes Avro, Thrift, Protocol Buffers, Pig, etc.)

hildrum commented 9 years ago

(a) Myself, I'm neutral on merging the two toolkits. But since @apyasic is the committer on streamsx.parquet, I put a heavy weight on @apyasic's opinion. So -1 on merging the toolkits from me at this time.
(b) is entirely a streamsx.hdfs issue, and so I think should be discussed there rather than here.

chanskw commented 9 years ago

Alex, thanks for the feedback. It's good to know that there are plans to decouple Parquet from Hadoop. From the initial discussions, it was not obvious that that is going to be the direction.

If that is the case, then +1 and I fully agree that we should have a parquet toolkit project, and we work on decoupling the reading/writing from the parquet parsing/formatting in the future.

In light of this, then I will propose that when we work on the Parquet Reader, the reader should go to the Parquet toolkit and not the HDFS toolkit. We keep the reader and writer together. We also work on eventually having a reader and writer that is decoupled from HDFS.

The work on text file, splittable compressed file and sequence file should stay in the HDFS toolkit.

apyasic commented 9 years ago

Thanks Kris and Samantha. We will work to integrate Parquet Reader developed by Kris into parquet toolkit.

hildrum commented 9 years ago

To clarify, there is no parquet reader yet. If the parquet reader gets built as I plan, putting it in the parquet toolkit would involve either duplicating code between streamsx.hdfs and streamsx.parquet or making streamsx.hdfs depend on streamsx.parquet.

But I don't think it's productive to engage in an extensive discussion over it at this point. I'll be sure to bring this up again if & when I have a parquet reader ready to commit, and we can make a decision then.

apyasic commented 9 years ago

Thanks for clarifications Kris. We should decide where ParquetReader should go before we develop it and then work as a team to get it going. Otherwise, we will have to do extra work to re-integrate it with the streamsx.parquet toolkit.

chanskw commented 9 years ago

I think we agreed to put this in the HDFS toolkit on a branch. Closing and will continue discussion of integrating this feature in the HDFS toolkit.

IBMStreams / administration

Proposed addition to the streamsx.hdfs #67