Open TomAugspurger opened 6 years ago
We also discussed how best to combine this with hyperparameter search metaestimators. There is a tradeoff between combining multiple incremental pass computations (which share I/O nicely) and not computing too many (so that we can get a first result quickly). When there is more work than cores it's not clear how to make this tradeoff well.
This may be more of a research question and thus beyond our scope starting out, but is something to keep in mind.
There's another more basic advantage to implementing a pipeline in Dask: repeated function calls will not be re-computed. This is advantageous in a hyperparameter search, especially when the initial steps are expensive (e.g., with a CountVectorizer which computes n-grams).
cc @mmccarty.
There are a few types of estimators w.r.t. how much data they need to see before "being trained."
Given a pipeline with a mixture of these, we have a choice of how data should flow through the pipeline. With stateless and "full pass" it doesn't really matter; our choice is already made. With an incremental estimator, we can train on the entire dataset before moving to the next estimator, or we can do things in blocks. The "correct" choice will depend on a few things. In an out-of-core context, doing things blockwise will minimize IO, as each block can be loaded and passed through the pipeline.