Investigate Support for Apache Beam - Githubissues

ZuInnoTe / hadoopoffice

HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)

Apache License 2.0

63 stars 31 forks source link

Investigate Support for Apache Beam #21

Open jornfranke opened 7 years ago

jornfranke commented 7 years ago

Investigate support and create examples+unit tests for using HadoopOffice with Apache Beam (https://beam.apache.org/)

Apache Beam supports writing Big Data jobs once and run them on multiple platforms (e.g. Flink, Spark, Apex, Google Cloud Dataflow...)

jornfranke commented 7 years ago

It seems that we can use HadoopInputFormatIO to read: https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.html It seems that we can use HDFSFileSink to write: https://beam.apache.org/documentation/sdks/javadoc/0.6.0/org/apache/beam/sdk/io/hdfs/HDFSFileSink.html

jornfranke commented 6 years ago

The new classes for reading are: https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/FileBasedSource.html for writing are: https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/FileBasedSink.html