Closed stared closed 8 years ago
Interesting idea. We currently print the progress every generation of the algorithm, but perhaps we could also print the progress toward evaluating each generation's population along the way too. I certainly appreciate having a tqdm-like progress bar for long-running for loops.
Thanks! Now I see that it prints progress per step (with verbosity=2
or all options?).
Yet, since my computer is relatively old, I though it doesn't (only after a few minutes on the hill_valley_noisy_data
example I got he first step; I didn't know if it was running at all).
Or is the initialization particularly costly? (The next steps were much shorter)
It only prints the progress per generation with verbosity=2
. I was aiming for a "quiet" classifier by default.
Or is the initialization particularly costly? (The next steps were much shorter)
The initialization shouldn't be much more expensive than every other generation. Perhaps what happened was your initial population had a complex pipeline in it that takes a long time to evaluate, and the selection procedure for the next TPOT generation (that aims to maximize accuracy but minimize pipeline complexity) threw that slow pipeline out because it was too complex. That's my best guess.
I'm a big fan of this idea (and tqdm in general).
We'll have to figure out how to get tqdm wrapped into the evolutionary algorithm. As we discussed yesterday, @teaearlgraycold, the GP algorithm is all in the eaSimple
call. Here's the eaSimple
code: https://github.com/DEAP/deap/blob/master/deap/algorithms.py#L162
Unless we're going to monkey-patch over DEAP I think we should consider the number of steps in the process of fitting a classifier pipeline to be equal to the population count multiplied by the generation count, and then we can use the manual approach to tqdm's stepping like so:
def fit(self, features, classes):
total = self.population_size * self.generations
with tqdm(total=total) as self.pbar:
# Continue with remainder of function
def _evaluate_individual(self, individual, training_testing_data):
try:
# Perform evaluation
. . .
except Exception:
# Catch-all: Do not allow one pipeline that crashes to cause TPOT to crash
# Instead, assign the crashing pipeline a poor fitness
return 5000., 0.
finally:
self.pbar.update(1) # One more pipeline evaluated
# Return values etc.
That's assuming that the GP is scoring pipelines by using the _evaluate_individual()
method during the different generations.
I like that idea. The only catch is that DEAP caches results for duplicate pipelines in the current generation, so it's likely that the actual executed number of _evaluate_individual()
calls will be less than pop size
x generations
.
BTW: Probably better to just assign self.pbar
in __init__
.
... and of course all of this will only occur when verbosity >= 2
.
We could add a null pointer in the __init__()
method but tqdm is made to work with the with
block so that it can add pre and post-processing (initializing the bar, hiding it after the block of code finishes), so it shouldn't be initialized in __init__()
.
Is there some piece of code that DEAP touches once a generation finishes? I could have it bump the progress bar up to the correct level for the current generation if cached pipelines were executed.
That's unfortunate wrt the with
statement. That's an entire indentation on code that's already fairly indented.
Is there some piece of code that DEAP touches once a generation finishes? I could have it bump the progress bar up to the correct level for the current generation if cached pipelines were executed.
It would be super hacky and probably not good practice, but you could probably inject something here: https://github.com/rhiever/tpot/blob/master/tpot/tpot.py#L259
That line is used to calculate the statistics (min, mean, max) at the end of every generation. Maybe worth digging around the DEAP docs more.
Looks like it might be a bit better to do it in the selection method, _combined_selection_operator
. That occurs at the start of every generation.
Since it takes quite a lot of time to get the results: is there any way to use it with a progress bar, e.g. tqdm (For example, per generation;
.next()
method would suffice)?Or even better, a built-in progress bar as in PyMC3 (for an example: https://github.com/markdregan/Bayesian-Modelling-in-Python/blob/master/Section%203.%20Hierarchical%20modelling.ipynb).