Possible speed up at large data sets

zhangyingmath commented 8 years ago

Hi Randal,

We spoke briefly after the Data Philly Meetup at SIG on Feb 18, and I really like your talk. First of all, thank you for your nice talk!

During your talk, you mentioned that currently TPOT could be slow for large data sets, and we were speaking about a possible way to speed it up. Here is the rough idea:

Let's say you have a large set of data,

As you start the pipeline/model selection process, start from the relatively simple pipelines/models, using only a small, random subsets of the data to do the fittings and testings. This serves as a first round, somewhat heuristic weed-out process.

Additional twist 1: because you are only looking at small subsets, if you do things like "repeating the same processes five times and average out" (like a five-fold validation, except that you don't need to be as careful about the five-fold part -- any five random subsets of the data will do), it still wouldn't get too expensive.

If there are any simple pipelines/models that look reasonably good, there are two possibilities to proceed:

i) try them on a larger subset/the entire data set. If they look really successful, we have a winner.

ii) if the pipelines/models are OK but still not satisfactory enough, then maybe they suggest that a certain pattern in the pipelines or a certain type of models which can be a jump start for more complicated models. You can then choose a pipeline/model with higher complexity along the same line, choose a bigger subset size, and reiterate the process.

Additional twist 2: As the subsets grow large, it may be too expensive to do the "multiple subsets, average out" process. So we may skip doing it.

Additional twist 3: in the model fitting stage, suppose there is something like a gradient descend, then when you are looking at simple pipeline/models and small subsets, it may be OK to just start with a relative big step size. As the size of the subsets grow, you may choose to use smaller step sizes, etc.

Now, if any of the simple pipelines/models don't look promising at a small subset size, we may choose to proceed with a few more iterations before completely weeding them out, just so that we don't weed them out too prematurely. But in any case looking for a global optimal pipeline/model is hard anyway, so it may be OK to just find some local optimum along the search tree.
If the final pipeline/model is too complicated, trim them down. We may reach a simple model different from any of the simple ones that we have previously tried.

So that's the rough sketch. Since I am not very familiar with TPOT I may have said things that are not true or things that you have already done. Just a random thought. The idea is for TPOT to try out pipelines/models more like a human: we start from simple models and a few heuristic experiments; if we think we are on the right track, we try more complicated hypothesis and more careful experiments, to see if it gets any better; if we missed out the correct path from the start, but finally reached something that works but looks horribly complicated, we try to trim it down and look for the simplified essence of it. That's the general idea.

I haven't implemented any of this idea, and conceivable it might work on long data but I don't know how it plays on wide data.

Thanks very much,

Ying Zhang

rhiever commented 8 years ago

Thank you for writing this up, Ying. This is definitely an idea that we'll have to experiment with.

jni commented 8 years ago

Would it make sense to train/evaluate each individual in the population with a different subset of a large dataset every time? I have a dataset of ~4M rows and TPOT appears to be kinda useless in this scenario... =)

Sorry if my question is naive; I don't have much experience with evolutionary algorithms.

rhiever commented 8 years ago

4M rows is definitely too large for TPOT right now until we figure out a method like the one discussed in this issue for training on subsets. It would even take a long time to train one model on 4M rows, so you can imagine that a tool that trains many pipelines would be very slow on 4M rows.

One option for you is to try to reduce the number of rows by removing duplicates and other data reduction methods. Otherwise any sort of model or pipeline optimization technique will not really be feasible for your data set unless you have a lot of parallelized computation power.

On Wednesday, March 2, 2016, Juan Nunez-Iglesias notifications@github.com wrote:

Would it make sense to train/evaluate each individual in the population with a different subset of a large dataset every time? I have a dataset of ~4M rows and TPOT appears to be kinda useless in this scenario... =)

Sorry if my question is naive; I don't have much experience with evolutionary algorithms.

— Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/87#issuecomment-191098163.

Randal S. Olson, Ph.D. Postdoctoral Researcher, Institute for Biomedical Informatics University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson https://twitter.com/randal_olson http://www.randalolson.com

jni commented 8 years ago

unless you have a lot of parallelized computation power

Which I do. ;) But I imagine that would require a lot of concerted software engineering effort to get working. Nevertheless, I'm wondering whether there is a better strategy for local TPOT than naive and dramatic subsampling, as I mentioned, such as varying the subsets used by individuals in the populations.

rhiever commented 7 years ago

I bet we could hack this into TPOT for a quick test by passing TPOT's cv parameter a StratifiedShuffleSplit instead of KFold. If the StratifiedShuffleSplit's test set is a small sample (e.g. 10%) of the full training data, then that would achieve what we're looking for here.

rhiever commented 6 years ago

This feature was implemented in TPOT 0.8.

EpistasisLab / tpot

Possible speed up at large data sets #87