pipeline not fitted even though fit was called

EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

http://epistasislab.github.io/tpot/

GNU Lesser General Public License v3.0

9.76k stars 1.57k forks source link

pipeline not fitted even though fit was called #308

Closed geoHeil closed 8 years ago

geoHeil commented 8 years ago

pipeline does not seem to be fitted even though fit was called

Context of the issue

Even though fit is executed when I try to obtain the best result I get the error

tpot = TPOTClassifier(verbosity=2, max_time_mins=10)
tpot.fit(X_train, y_train)
tpot.export('tpot_pipe.py')
print(tpot.score(X_test, y_test))

The error

[ValueError                                Traceback (most recent call last)
<ipython-input-10-7d10df09096d> in <module>()
----> 1 tpot.export('tpot_pipe.py')
      2 #print(tpot.score(X_test, y_test))

/usr/local/lib/python3.5/site-packages/tpot/base.py in export(self, output_file_name)
    471         """
    472         if self._optimized_pipeline is None:
--> 473             raise ValueError('A pipeline has not yet been optimized. Please call fit() first.')
    474 
    475         with open(output_file_name, 'w') as output_file:

ValueError: A pipeline has not yet been optimized. Please call fit() first.

weixuanfu commented 8 years ago

Could you please post the stdout after fit() function? The best pipeline should be printed out if fit() finished normally. Please also let us know which platform this codes ran on. More details will help us find the bugs causing this issue.

weixuanfu commented 8 years ago

Also, please test codes above again without max_time_mins=10. This parameters will override generation parameter and kill the fit() process in 10 minutes. If the process not get a best pipeline in the time limit, no fitted pipeline will be exported. I think it maybe the reason of this issue.

Update: the codes below may be used to reproduce the issue. I think we need add a friendly warning for using this parameters when running a time-consuming jobs. Sorry for the confusion.

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from tpot import TPOTClassifier
X, y = make_classification(n_samples=200, n_features=100,
                                    n_informative=2, n_redundant=10,
                                    random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(verbosity=2, max_time_mins=1)
tpot.fit(X_train, y_train)
tpot.export('tpot_pipe.py')
print(tpot.score(X_test, y_test))

geoHeil commented 8 years ago

When I remove the option it seems to run longer:

2016-11-08 18:05:18,713 INFO -- MainProcess connectionpool.py:214 -- Starting new HTTP connection (1): update_checker.bryceboe.com
Optimization Progress:   0%|          | 7/10100 [10:51<201:40:59, 71.94s/pipeline]
Timeout during evaluation of pipeline #7. Skipping to the next pipeline.
Optimization Progress:   0%|          | 14/10100 [14:08<88:17:27, 31.51s/pipeline]
Timeout during evaluation of pipeline #14. Skipping to the next pipeline.
Optimization Progress:   0%|          | 16/10100 [17:45<182:17:02, 65.08s/pipeline]
Timeout during evaluation of pipeline #16. Skipping to the next pipeline.
Optimization Progress:   0%|          | 19/10100 [25:09<374:49:26, 133.85s/pipeline]
Timeout during evaluation of pipeline #19. Skipping to the next pipeline.
Optimization Progress:   0%|          | 22/10100 [25:23<136:50:45, 48.88s/pipeline]

I hope it works. How long does such a simulation usually run? 10100 seems to be a lot.

weixuanfu commented 8 years ago

The default settings in TPOT on generation number and population size is population_size=100, generations=100. So, with generation 0, the number of run is 100*(100+1) = 10100. I think the dataset you used should be very huge and many pipeline were skiped due to a time limit for evaluating a single pipleline (max_eval_time_mins = 5 in default) .

To estimate the speed of simulation in the dataset, I suggested reduce the generation number to ~10 and increase max_eval_time_mins for your dataset by adding generations=10, max_eval_time_mins = 10 into TPOTClassifier. Also I suggest this time-consuming process should run in a linux platform. A strange bug related to #300 was just found in MacOS on Macbook Pro and we will fix it in next version of TPOT. The current version of TPOT is stable in Linux (should be also all right in Windows.)

geoHeil commented 8 years ago

good point regarding linux - the python process just crashed on my macbook :(

BasmaRG commented 3 years ago

Also, please test codes above again without max_time_mins=10. This parameters will override generation parameter and kill the fit() process in 10 minutes. If the process not get a best pipeline in the time limit, no fitted pipeline will be exported. I think it maybe the reason of this issue.

Update: the codes below may be used to reproduce the issue. I think we need add a friendly warning for using this parameters when running a time-consuming jobs. Sorry for the confusion.
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from tpot import TPOTClassifier
X, y = make_classification(n_samples=200, n_features=100,
                                    n_informative=2, n_redundant=10,
                                    random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(verbosity=2, max_time_mins=1)
tpot.fit(X_train, y_train)
tpot.export('tpot_pipe.py')
print(tpot.score(X_test, y_test))

Thanks, it works with me.