Controlling Data Flowing through Pipelines

TomAugspurger commented 6 years ago

There are a few types of estimators w.r.t. how much data they need to see before "being trained."

Stateless: No training is required
Incremental: Can be trained on batches. Additional examples will change the learned parameters
Full Pass: Must see "all" the data.

Given a pipeline with a mixture of these, we have a choice of how data should flow through the pipeline. With stateless and "full pass" it doesn't really matter; our choice is already made. With an incremental estimator, we can train on the entire dataset before moving to the next estimator, or we can do things in blocks. The "correct" choice will depend on a few things. In an out-of-core context, doing things blockwise will minimize IO, as each block can be loaded and passed through the pipeline.

mrocklin commented 6 years ago

We also discussed how best to combine this with hyperparameter search metaestimators. There is a tradeoff between combining multiple incremental pass computations (which share I/O nicely) and not computing too many (so that we can get a first result quickly). When there is more work than cores it's not clear how to make this tradeoff well.

This may be more of a research question and thus beyond our scope starting out, but is something to keep in mind.

stsievert commented 6 years ago

There's another more basic advantage to implementing a pipeline in Dask: repeated function calls will not be re-computed. This is advantageous in a hyperparameter search, especially when the initial steps are expensive (e.g., with a CountVectorizer which computes n-grams).

TomAugspurger commented 4 years ago

cc @mmccarty.

dask / dask-ml

Controlling Data Flowing through Pipelines #192