EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.69k stars 1.57k forks source link

Add progress bar #140

Closed stared closed 8 years ago

stared commented 8 years ago

Since it takes quite a lot of time to get the results: is there any way to use it with a progress bar, e.g. tqdm (For example, per generation; .next() method would suffice)?

Or even better, a built-in progress bar as in PyMC3 (for an example: https://github.com/markdregan/Bayesian-Modelling-in-Python/blob/master/Section%203.%20Hierarchical%20modelling.ipynb).

rhiever commented 8 years ago

Interesting idea. We currently print the progress every generation of the algorithm, but perhaps we could also print the progress toward evaluating each generation's population along the way too. I certainly appreciate having a tqdm-like progress bar for long-running for loops.

stared commented 8 years ago

Thanks! Now I see that it prints progress per step (with verbosity=2 or all options?). Yet, since my computer is relatively old, I though it doesn't (only after a few minutes on the hill_valley_noisy_data example I got he first step; I didn't know if it was running at all).

Or is the initialization particularly costly? (The next steps were much shorter)

rhiever commented 8 years ago

It only prints the progress per generation with verbosity=2. I was aiming for a "quiet" classifier by default.

Or is the initialization particularly costly? (The next steps were much shorter)

The initialization shouldn't be much more expensive than every other generation. Perhaps what happened was your initial population had a complex pipeline in it that takes a long time to evaluate, and the selection procedure for the next TPOT generation (that aims to maximize accuracy but minimize pipeline complexity) threw that slow pipeline out because it was too complex. That's my best guess.

danthedaniel commented 8 years ago

I'm a big fan of this idea (and tqdm in general).

rhiever commented 8 years ago

We'll have to figure out how to get tqdm wrapped into the evolutionary algorithm. As we discussed yesterday, @teaearlgraycold, the GP algorithm is all in the eaSimple call. Here's the eaSimple code: https://github.com/DEAP/deap/blob/master/deap/algorithms.py#L162

danthedaniel commented 8 years ago

Unless we're going to monkey-patch over DEAP I think we should consider the number of steps in the process of fitting a classifier pipeline to be equal to the population count multiplied by the generation count, and then we can use the manual approach to tqdm's stepping like so:

def fit(self, features, classes):
    total = self.population_size * self.generations

    with tqdm(total=total) as self.pbar:
        # Continue with remainder of function

def _evaluate_individual(self, individual, training_testing_data):
    try:
        # Perform evaluation
        . . .
    except Exception:
        # Catch-all: Do not allow one pipeline that crashes to cause TPOT to crash
        # Instead, assign the crashing pipeline a poor fitness
        return 5000., 0.
    finally:
        self.pbar.update(1) # One more pipeline evaluated

    # Return values etc.

That's assuming that the GP is scoring pipelines by using the _evaluate_individual() method during the different generations.

rhiever commented 8 years ago

I like that idea. The only catch is that DEAP caches results for duplicate pipelines in the current generation, so it's likely that the actual executed number of _evaluate_individual() calls will be less than pop size x generations.

BTW: Probably better to just assign self.pbar in __init__.

... and of course all of this will only occur when verbosity >= 2.

danthedaniel commented 8 years ago

We could add a null pointer in the __init__() method but tqdm is made to work with the with block so that it can add pre and post-processing (initializing the bar, hiding it after the block of code finishes), so it shouldn't be initialized in __init__().

Is there some piece of code that DEAP touches once a generation finishes? I could have it bump the progress bar up to the correct level for the current generation if cached pipelines were executed.

rhiever commented 8 years ago

That's unfortunate wrt the with statement. That's an entire indentation on code that's already fairly indented.

Is there some piece of code that DEAP touches once a generation finishes? I could have it bump the progress bar up to the correct level for the current generation if cached pipelines were executed.

It would be super hacky and probably not good practice, but you could probably inject something here: https://github.com/rhiever/tpot/blob/master/tpot/tpot.py#L259

That line is used to calculate the statistics (min, mean, max) at the end of every generation. Maybe worth digging around the DEAP docs more.

danthedaniel commented 8 years ago

Looks like it might be a bit better to do it in the selection method, _combined_selection_operator. That occurs at the start of every generation.