coursera / dataduct

DataPipeline for humans.
Other
252 stars 83 forks source link

allow setting stage to false for transform steps that dont use input or output nodes #179

Closed cliu587 closed 8 years ago

cliu587 commented 8 years ago

PTAL @sb2nov, @darinyu-coursera. Will land after the new load_reload_pk step is tested and with this diph.

cliu587 commented 8 years ago

@sb2nov do you know why we set output_node=base_output_node at https://github.com/coursera/dataduct/pull/179/files#diff-59074e91ee415f9f629abf53692c99b4L114?

It seems self._output is potentially different from base_output_nodeas per the computation in L103.

sb2nov commented 8 years ago

If we used self_output it will create multiple staging directories instead what we want is a single staging directory that gets mapped to multiple nodes based on subdirectories so that the command doesn't need to figure out which staging directory maps to what output and is easier to manage.

sb2nov commented 8 years ago

Let me know if you want more details on it.

sb2nov commented 8 years ago

LGTM though