coursera / dataduct

DataPipeline for humans.
Other
252 stars 83 forks source link

Worker group support for dataduct steps #182

Closed cliu587 closed 8 years ago

cliu587 commented 8 years ago

Add support for worker groups, specified via configs.

Currently we do not allow a mixed workflow where some steps are ran via worker groups, while others are ran via Datapipeline instance/cluster management. This can be added in if required.

sb2nov commented 8 years ago

LGTM

Yeah I agree that we can wait on mixed workflows until a bit later once we have ironed out more parts of this.

My worry with people using worker groups in EMR is that intermediate results written to HDFS must be cleaned correctly on failure as this happens really well for the staging directories for but for intermediate steps it is just something people need to be more watchful of.