d6t / d6tflow

Python library for building highly effective data science workflows
https://d6tflow.readthedocs.io/en/latest/
MIT License
951 stars 77 forks source link

Potential Typo in the Docs - Define Upstream Dependency Tasks #39

Open DOH-Manada opened 3 years ago

DOH-Manada commented 3 years ago

https://d6tflow.readthedocs.io/en/latest/tasks.html

This following code defines a single output task and calls it as a dependency to other tasks. Yet TaskSingleOutput1 & TaskSingleOutput2 are not defined anywhere on this page.

# quick save one output
class TaskSingleOutput(d6tflow.tasks.TaskPqPandas):

    def run(self):
        self.save(data_output)

# no dependency
class TaskSingleInput(d6tflow.tasks.TaskPqPandas):
    #[...]

# single dependency
@d6tflow.requires(TaskSingleOutput)
class TaskSingleInput(d6tflow.tasks.TaskPqPandas):
    #[...]

# multiple dependencies
@d6tflow.requires({'input1':TaskSingleOutput1, 'input2':TaskSingleOutput2})
class TaskMultipleInput(d6tflow.tasks.TaskPqPandas):
    #[...]

Also, it should be made clear in something like this example that the child keys are labeled in the persist, and the parent keys are defined in the dependency call.

# multiple dependencies, single & multiple outputs
@d6tflow.requires({'input1':TaskSingleOutput, 'input2':TaskMultipleOutput})
class TaskMultipleInput(d6tflow.tasks.TaskPqPandas):
    def run(self):
        data = self.inputLoad(as_dict=True)
        data1a = data['input1'] # We reference the key defined in the dependency call
        data2a = data['input2']['output1']  # 'output1' is a persist label defined in TaskMultipleOutput
        data2b = data['input2']['output2']