EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.68k stars 1.57k forks source link

TPOT2 and the future of TPOT development -- From the Devs #1322

Open perib opened 1 year ago

perib commented 1 year ago

Since the release of TPOT in 2016, we and others have experimented with several ideas and improvements to the algorithm. However, due to the structure of TPOT's codebase, it has been difficult to merge all these features under one package. TPOT's code can be challenging to parse and modify. The result is a fragmented development space with different features and ideas existing in isolation on different forks. Due to this, it is hard to conduct research with TPOT.

We have decided to refactor the code base to make future research and development easier. Our main goals of the refactor are specifically to improve modularity, extendability, and maintainability. We want it to be easier for users to experiment with the algorithm and to contribute to the project.

Current Status of TPOT2

TPOT2 is in Alpha and mostly has feature parity with the original TPOT1.

Currently, the user-facing TPOTClassifier and TPOTRegressor classes are reasonably stable and unlikely to see many changes. One benefit of the simplified API is that we can update the algorithm under the hood without drastically changing the user experience.

We are still working on ensuring the backend meets our modularity, flexibility, maintainability, and extendability goals. There may be changes to the underlying code as we improve the software engineering (feedback is welcome!).

Differences between TPOT1 and TPOT2 - Porting your code

From the user's perspective, using TPOT1 and TPOT2 is very similar. We recommend you take a look at the TPOT2 Tutorials folder for Jupyter notebooks with examples.

Estimators Both wrap the AutoML algorithm within a scikit-learn estimator, though the parameters may be slightly different, and we encourage users to read the documentation. TPOT1 provides the TPOTClassifier and TPOTRegressor classes. These are also present in TPOT2, though they have fewer parameters (for example, in TPOT2 the user does not need to provide the number of generations or population size). The goal for these classes in TPOT2 is to reduce the number of decisions and parameters to abstract away the evolutionary algorithm and simplify the experience for the users. Currently, TPOT2 just uses default values for the removed parameters, but in the future, we will look into potentially implementing a meta-learner similar to Auto-Sklearn. (If the user wants to manually tune all of the parameters, they are currently available in the TPOTEstimator class.) The configuration dictionaries also have a different structure to make them compatible with Optuna.

Results The outputs in TPOT2 have been simplified to be more user-friendly. the fittedpipeline attribute still points to the fitted pipeline chosen by the algorithm. The evaluated_individuals and pareto_front parameters now return an organized Pandas dataframe containing the all of the evaluated pipelines, their scores, and some other metadata.

Graphpipeline The last major difference is that TPOT2 now supports graph-based pipelines. To do this, we implemented our own graph-estimator class that mirror the scikit-learn Pipeline class.

Bug fixes TPOT1 has a bug in which it cannot terminate some pipelines after the time-out causing it to run endlessly #876 #645 #905 #508 #1214 #1200 #1107 #875 #797 #780 More flexible pipeline definitions allow FSS to be only included in leave nodes, preventing undefined behavior when they are set in inner nodes and without restricting to a linear pipeline shape. #1250 No more duplication as a result of stacking estimators #1242 better dask handling #779 #304 Other issues resolved Better logs and attributes for accessing all evaluated individuals #1318 #1229 #982 #800 #780 #337 Parameter for encoding of categorical/ordinal columns #1237 support memory parameter in dask #1228 #961 TPOT2 can account for cases where the number of samples for a class < number of folds of CV #1220 more flexible pipeline search space definitions, preprocessing step #1190 #1182 #479 Support for custom, user-defined multi-objective functions, including a complexity function that tries to estimate number of learned parameters #1045 #783 resume TPOT2 run from checkpoint #977 stop at first condition met #504

Planned Features meta learning #1254 covariate adjustment #1311 #1209 callbacks #678 better ensembling support #479 #105 better visualizations #337 better/custom initializations #59

What does this mean for TPOT1?

We will not be continuing to develop new features for TPOT1. We may fix minor bugs and dependency issues as they arise to maintain compatibility for continuing users. However, going forward, our primary focus will be on developing TPOT2.

You can find the TPOT2 repository here: https://github.com/EpistasisLab/tpot2/tree/main

Thank you for your interest in TPOT

We would love any feedback from the community! Let us know what you would like to see. Feel free to open an issue on the new page if you have any questions, suggestions, contributions, or bugs to report.