Open rhiever opened 8 years ago
Should we choose a couple/few data sets to test on, to try and create a more robust analysis? Which ones might be most appropriate? MNIST, wine, breast cancer?
If you run this script, you'll have access to a whole bunch of 'em. Take your pick. :-)
I think just one data set is fine to start with though, as a proof of concept.
Ping. Run into any issues with this?
I've created my own version of the eaSimple algorithm where we can dig into the individual/ensemble statistics, but I've had difficulty in exposing and aggregating each individual pipeline's guesses. But in the last day or so I've broken through that, and have started getting numbers. I think I'm gonna spin up a cheap AWS instance and just run a ton of tests.
I don't think you'll need to roll your own version of eaSimple. Here's some code from another project I ran where you can store the population in the log and then do post-analysis on the population in the log.
stats = tools.Statistics(lambda ind: (int(ind.fitness.values[0]), round(ind.fitness.values[1], 2)))
stats.register("Minimum", np.min, axis=0)
stats.register("Maximum", np.max, axis=0)
# This should store a copy of pop every generation
stats.register("Population", lambda x: copy.deepcopy(pop))
# Use normal TPOT settings, of course -- not these settings
pop, log = algorithms.eaSimple(pop, toolbox, cxpb=0., mutpb=0.5, ngen=1000,
stats=stats, halloffame=hof, verbose=False)
Let me know if that works. Alternatively, you can modify the HOF to store the top 100 best pipelines discovered so far, and change
stats.register("Population", lambda x: copy.deepcopy(pop))
to
stats.register("HOF", lambda x: copy.deepcopy(hof))
and that will only change the analysis slightly -- using the best 100 pipelines ever as the ensemble instead of the pipelines currently in the population.
Interesting, I had convinced myself that the Statistics object wouldn't be able to give us access to the population directly. I'll test this out; it shouldn't change my analysis after the fact that much.
Got it working with the statistics object, thanks for the tips. I'm gonna spin up these tests.
Great! :+1:
Some of the shorter tests are wrapping up and I think I have enough for some preliminary results -- I'll try to clean things up and link them here in the next couple of days.
Let me know if you want to schedule another video chat. I'm excited to hear how this turned out!
Hey so I've cleaned up some of my data and made it available here. I've been trying to come up with useful visualizations, and figured it'd be more productive to share it in the meantime.
What are each of the new columns? I'm looking at the data this morning.
Alright, so I took the same ideas from the consensus operators that we tried and applied them here. Each individual / population is evaluated on the test dataset.
acc* – each individual's guess is weighted according to their individual accuracy. uni* – each individual's guess has the same weight.
_max_class – the class that has the highest weight (or in the uni case, the highest frequency) is the ensemble's guess for that test instance. _mean_class – the class with the mean weight / frequency is the ensemble's guess _median_class – the class with the median weight / frequency is the ensemble's guess _min_class – the class with the minimum weight / frequency is the ensemble's guess *_threshold_class – the first class that passes a certain threshold in percentage of weight is the ensemble's guess.
I'm getting a lot of variance in a few of the columns, so I may run some more trials.
What benchmarks are you running it on? It looks like the classification accuracy for many of the runs are fairly high, so there probably isn't much room for ensembles to improve. What about a harder data set, e.g., GAMETES-hard? Maybe we should just run a large benchmark on the HPCC?
You're right that the data I was using was perhaps too easy– I was using testing code that tested with the sklearn digits dataset, rather than MNIST! This is embarrassing to say the least. On the bright side, at least these tests suggest that the operators are somewhat robust in the smaller-data, slightly-longer, slightly-bigger population 'regime'.
In the interest of time, how about I'll run the same tests on random samples from the GAMETES-hard and MNIST proper to see if there's promise, and in the mean-time we can prep for a larger HPCC benchmark? I can run my tests in a more parallel manner so it's not a week turnaround.
Sounds good to me. Want to send in a PR on this branch?
I'm currently finishing up some other TPOT benchmarks -- shouldn't take more than the weekend -- but I can slate this benchmark for the next batch.
On Thu, Mar 24, 2016 at 6:54 PM, Nathan notifications@github.com wrote:
You're right that the data I was using was perhaps too easy– I was using testing code that tested with the sklearn digits dataset, rather than MNIST! This is embarrassing to say the least. On the bright side, at least these tests suggest that the operators are somewhat robust in the smaller-data, slightly-longer, slightly-bigger population 'regime'.
In the interest of time, how about I'll run the same tests on random samples from the GAMETES-hard and MNIST proper to see if there's promise, and in the mean-time we can prep for a larger HPCC benchmark? I can run my tests in a more parallel manner so it's not a week turnaround.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/rhiever/tpot/issues/105#issuecomment-201064327
Randal S. Olson, Ph.D. Postdoctoral Researcher, Institute for Biomedical Informatics University of Pennsylvania
E-mail: rso@randalolson.com | Twitter: @randal_olson https://twitter.com/randal_olson http://www.randalolson.com
One of the common arguments against population-based optimization methods is that they are significantly slower than methods that work with one (or a few) solutions at a time. I think one smart way to turn that argument on its head would be to see if creating an ensemble out of the TPOT population would be useful.
An initial exploration could be to run TPOT as normal, and collect additional statistics about the performance of the population as an ensemble. This could be done with a very "hacky" version of TPOT; no need to engineer it before we prove this idea's efficacy.
Basically, for every generation:
1) Store the classifications of every individual
2) Use various ensemble methods to combine their classifications into a single classification (min, max, threshold, majority, weighted based on performance on training set)
3) Plot the effectiveness of all of these population ensemble methods over time
What to look for: