EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.66k stars 1.56k forks source link

Tpot Frezees at 0% #1008

Closed ghost closed 4 years ago

ghost commented 4 years ago

We have large (>500K points) dataset with more than 20 features with two classes. Data is properly scaled and NaN are imputed. The setting we are executing is

pipelineOptimizer=TPOTClassifier(generations=1,population_size=1,verbosity=3,config_dict=tpot_config, scoring='balanced_accuracy', early_stop=10,max_time_mins=100,n_jobs=1)

The config only has one random forest configuration. We are running it on both windows and linux machine that has sklearn version 0.22.1 and TPOT version 0.10.2.

Current result

1 operators have been imported by TPOT.
Skipped pipeline #1 due to time out. Continuing to the next pipeline.
Traceback (most recent call last):
File "C:\anaconda\lib\site-packages\tpot\base.py", line 752, in fit
per_generation_function=self._check_periodic_pipeline
File "C:\anaconda\lib\site-packages\tpot\gp_deap.py", line 236, in eaMuPlusLambda
per_generation_function(gen)
File "C:\anaconda\lib\site-packages\tpot\base.py", line 1039, in _check_periodic_pipeline
self._update_top_pipeline()
File "C:\anaconda\lib\site-packages\tpot\base.py", line 835, in _update_top_pipeline
raise RuntimeError('There was an error in the TPOT optimization '
RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "mlTpot.py", line 201, in <module>
callTpotForSupervisedLearning()#nodeProp,malnodeObserved,timeSeriesFeaturesFiles)
File "mlTpot.py", line 154, in callTpotForSupervisedLearning
pipelineOptimizer.fit(xtrain,ytrain)
File "C:\anaconda\lib\site-packages\tpot\base.py", line 784, in fit
raise e
File "C:\anaconda\lib\site-packages\tpot\base.py", line 775, in fit
self._update_top_pipeline()
File "C:\anaconda\lib\site-packages\tpot\base.py", line 835, in _update_top_pipeline
raise RuntimeError('There was an error in the TPOT optimization '
RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly.
weixuanfu commented 4 years ago

The issue is that max_eval_time_mins is too small for evaluating pipeline on this large dataset and random forest maybe very time-consuming with a large number of n_estimators. Please increase max_eval_time_mins to 20. If it does not work or is too slow, please use subsample to randomly down-sampling the dataset.

ghost commented 4 years ago

Thanks for your response. We tried the option you suggested but now optimization goes up to 100% and sometimes it gives the same error as reported above while the other it reports 50% balanced_accuracy. When running default RF outside TPOT, we get balanced_accuracy > 87 %. we are not sure where the issue is and how to solve it.

weixuanfu commented 4 years ago

Since there is only one estimator (random forest) for the config_dict, you may try template="Classifier" to avoid complicated stacking pipelines (>1 estimators) to save computational time for evaluating each pipeline.

ghost commented 4 years ago

Thanks for the help.