Cargill / pipewrench

Data pipeline automation tool
Apache License 2.0
25 stars 36 forks source link

Full local docker execution of pipelines #15

Closed afoerster closed 6 years ago

afoerster commented 6 years ago

Currently pipelines need to be tested on a cluster. We should have the ability to test Kudu/HDFS/Impala inside of a Docker container. This will speed development of pipelines and make it easier to verify bug fixes.

brockn commented 6 years ago

This is what streamsets uses internally for this same use case - https://github.com/clusterdock/topology_cdh

SandishKumarHN commented 6 years ago

@brockn which is better direct docker or clusterdock topology_cdh??

brockn commented 6 years ago

I'd look at using topology_cdh. As you know it's a ton of work to setup a CDH cluster inside docker yourself. While I think that is a useful experience and cool to have seen done, I'd look at using topology_cdh in the future. It's heavily used within StreamSets and also Cloudera.

afoerster commented 6 years ago

An advantage of using ClusterDock is that you don't have to maintain an image, which with as many dependencies as are needed isn't a small thing.

afoerster commented 6 years ago

Complete. Opening new issue for one pipeline that still needs a test, kudu-hdfs-parquet-sqoop