HDI-Project / ATM

Auto Tune Models - A multi-tenant, multi-data system for automated machine learning (model selection and tuning).
https://hdi-project.github.io/ATM/
MIT License
527 stars 140 forks source link

Separating repeated processing from classifier models #70

Open kkarrancsu opened 6 years ago

kkarrancsu commented 6 years ago

In between different runs of the ATM, the outputs of all the steps of the pipeline are "static," except for the input and output to the classifier that is chosen by BTB. What I mean by this is, for example, suppose PCA is in the pipeline, then every time ATM/BTB chooses a new model to run, it will recompute the PCA for the same dataset. Unless I'm misunderstanding the flow of data, this seems inefficient. Although the current pipeline is pretty simple (scaling/PCA), there could be more computationally intensive elements to the pipeline that people may want to add.

We can separate the pipeline into two pipelines, one that is "static" and the outputs stored somewhere to disk such that it can be recalled between runs, and a "dynamic" which is essentially the classifier, and any blocks which change based on the ATM/BTB model being run.

If you think this is a good idea, how do we want to go about architecting this from a software perspective? One approach is to compute the static pipeline before the test_classifier method is run and save that to the data directory where the train/test dataset is being saved.

bcyphers commented 6 years ago

Good points.

There actually was architecture for this in earlier versions of ATM: if PCA was part of the pipeline, it would be computed first, then the intermediate data representation would be saved to disk and the cached version would be loaded later on. I removed that function during a big refactor because it was complicating other parts of the code and didn't give us much speedup to just cache the PCA.

I do think it makes sense to do this down the line, but only if we add other static preprocessing steps (as you mentioned in #71). Until then, I think it's premature optimization to build in the caching infrastructure.

micahjsmith commented 5 years ago

This could be implemented by a feature of MLBLocks (#113) and should wait on the resolution of that issue.