Closed OlivierBinette closed 6 months ago
Note: this definition of a pipeline is for series of data transformation steps. This is different from sklearn, where pipelines are used to compose estimators.
Also, a robust implementation of my idea would require checking that dependencies graph are a DAG, maybe getting a topological ordering, and better managing step names and their uniqueness. It's not necessary too hard to do, especially using Python's built-in graph
package, but it's something to consider in terms of the complexity of the solution.
This should happen, but I don't think it should be in mismo. task orchestration is general enough that it should go in it's own module. But, mismo should be designed in a way that it is easy to use FROM this orchestration framework.
FYI, I have my own implementation of something similar that wraps pytask, take a look:
Should there be the equivalent of sklearn pipelines?
To be honest, I'm not a huge fan of how sklearn pipelines are implemented. They're hard to use, and you can't have one step refer to previous steps in the pipeline. Here's how I implemented a basic pipelines prototype for a previous project:
We have steps, i.e. functions with a specified set of dependencies (here named
inputs
). There are two ways two execute steps:compute
function and acache
dictionary. When using thecompute
function, step dependencies will be computed recursively using relevant content of thecache
as inputs.Note: currently, step names in any dependency graph should be unique. Ideally we'd create UUIDs for steps.
And we have Pipelines, i.e. series of steps. Steps that are explicitely listed in the pipeline are fed to callback functions, i.e. for persisting results or logging progress:
Example
Here are some steps:
You can use step functions directly:
Or you can compute steps using the dependency graph:
Full implementation: