cloudera-labs / envelope

Build configuration-driven ETL pipelines on Apache Spark
Apache License 2.0
158 stars 89 forks source link

Support Avro Format to FileSystemOutput #19

Closed ghost closed 6 years ago

ghost commented 6 years ago

Added changes for FileSystemOutput.java to make it support to write files in Avro format and made Simple File system example to run as per the steps given in readme.md

jeremybeard commented 6 years ago

Hi Prashanth,

For the time being we're not adding compile dependencies that are outside of the Cloudera stack. The reason Envelope sticks to Cloudera-provided dependencies is that Cloudera has already tested that they all work together for a common CDH version. If we add third party dependencies then Envelope would need to be responsible for that, which is too much work for us right now. For example, if we added this and then upgraded the Spark version it would hard to know if the third-party spark-avro version we were using was still compatible. When Cloudera adds support for spark-avro to Spark 2.x then we'll add this dependency back in.

Another option for you, if you want to use Avro, is to build your own output where you make your own jar that uses the dependency, then add it to the Spark execution using --jars, and then provide the output class name as the 'type' of the output in the Envelope pipeline configuration.

ghost commented 6 years ago

Thanks for the reply! That works for me.Could you please edit the example to parquet ,that might be helpful.Even the spark 1.6 applications used Data Bricks libraries(https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html#avro