Open rjstange opened 4 years ago
Skipping pipelines is fine because TPOT should avoid evaluating duplicated pipelines.
About those feature_name error, I think it may be from xgboost, you could convert pandas.DataFrame into numpy.ndarray (e.g df.values
) as input X (related to the issue #738 )
The joblib error message seems indicating that TimeoutException
was not working as expected. I am not sure how this happened. Could you please try your codes without using Jupyter notebook for a test?
BTW, for this large dataset, I suggest to use "TPOT light" configuration or customize a TPOT configuration without PolynomialFeatures
Thank you, I customized my TPOT config to exclude PolynomialFeatures and XGBoost since I already know that XGBoost does not perform as well on my data set. Hopefully this will speed up the process. Maybe this will fix the joblib errors, I'll post an update after this runs for a while.
Currently gone through over 400 pipelines and my output with verbosity set to 3, with a memory and periodic checkpoint folder set, and I am seeing a pattern of this in the jupyter notebook output box:
_pre_test decorator: _random_mutation_operator: num_test=0 feature_names mismatch: ['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'mode_Minor', 'time_signature_0/4', 'time_signature_1/4', 'time_signature_3/4', 'time_signature_4/4', 'time_signature_5/4', 'key_A', 'key_A#', 'key_B', 'key_C', 'key_C#', 'key_D', 'key_D#', 'key_E', 'key_F', 'key_F#', 'key_G', 'key_G#'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27'] expected key_F#, duration_ms, danceability, key_B, key_C#, key_G#, key_F, instrumentalness, key_D, energy, acousticness, valence, time_signature_1/4, liveness, key_D#, key_A#, tempo, key_C, key_A, loudness, time_signature_0/4, time_signature_3/4, mode_Minor, time_signature_5/4, speechiness, key_G, key_E, time_signature_4/4 in input data training data did not have the following fields: f25, f16, f24, f7, f21, f5, f19, f13, f6, f4, f2, f0, f15, f11, f18, f23, f10, f8, f9, f14, f27, f1, f17, f20, f12, f3, f26, f22. _pre_test decorator: _random_mutation_operator: num_test=0 feature_names mismatch: ['acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'mode_Minor', 'time_signature_0/4', 'time_signature_1/4', 'time_signature_3/4', 'time_signature_4/4', 'time_signature_5/4', 'key_A', 'key_A#', 'key_B', 'key_C', 'key_C#', 'key_D', 'key_D#', 'key_E', 'key_F', 'key_F#', 'key_G', 'key_G#'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27'] expected key_F#, duration_ms, danceability, key_B, key_C#, key_G#, key_F, instrumentalness, key_D, energy, acousticness, valence, time_signature_1/4, liveness, key_D#, key_A#, tempo, key_C, key_A, loudness, time_signature_0/4, time_signature_3/4, mode_Minor, time_signature_5/4, speechiness, key_G, key_E, time_signature_4/4 in input data training data did not have the following fields: f25, f16, f24, f7, f21, f5, f19, f13, f6, f4, f2, f0, f15, f11, f18, f23, f10, f8, f9, f14, f27, f1, f17, f20, f12, f3, f26, f22. _pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples, but n_samples = 50, n_neighbors = 89. _pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by StandardScaler.. _pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative. Pipeline encountered that has previously been evaluated during the optimization process. Using the score from the previous evaluation. Skipped pipeline #309 due to time out. Continuing to the next pipeline. Skipped pipeline #312 due to time out. Continuing to the next pipeline. Skipped pipeline #318 due to time out. Continuing to the next pipeline. Skipped pipeline #320 due to time out. Continuing to the next pipeline. Skipped pipeline #322 due to time out. Continuing to the next pipeline. Skipped pipeline #325 due to time out. Continuing to the next pipeline. Skipped pipeline #330 due to time out. Continuing to the next pipeline. Skipped pipeline #340 due to time out. Continuing to the next pipeline. Skipped pipeline #345 due to time out. Continuing to the next pipeline. Skipped pipeline #350 due to time out. Continuing to the next pipeline. Skipped pipeline #356 due to time out. Continuing to the next pipeline. Skipped pipeline #362 due to time out. Continuing to the next pipeline. Skipped pipeline #373 due to time out. Continuing to the next pipeline. Skipped pipeline #380 due to time out. Continuing to the next pipeline. Skipped pipeline #385 due to time out. Continuing to the next pipeline. Skipped pipeline #389 due to time out. Continuing to the next pipeline. Skipped pipeline #393 due to time out. Continuing to the next pipeline. Skipped pipeline #398 due to time out. Continuing to the next pipeline.
And in the Jupyter notebook's command line I get this joblib error on a regular basis:
WARNING:root:[MemorizedFunc(func=<function _fit_transform_one at 0x000001C3A12F30D0>, location=D:/TPOT-Cache/joblib)]: Exception while loading results for _fit_transform_one(PolynomialFeatures(degree=2, include_bias=False, interaction_only=False, order='C'), acousticness danceability duration_ms energy instrumentalness ... key_E key_F key_F# key_G key_G# 18499 0.50300 0.768 215854 0.6040 0.000000 ... 0 0 0 0 0 121197 0.00105 0.326 208507 0.9260 0.000000 ... 0 0 0 0 0 4181 0.00677 0.476 309677 0.7980 0.091400 ... 0 0 0 0 0 283240 0.06390 0.652 188707 0.8590 0.000000 ... 0 0 0 0 0 163835 0.91700 0.273 111595 0.3660 0.210000 ... 0 0 0 0 0 ... ... ... ... ... ... ... ... ... ... ... ... 159826 0.68800 0.370 146680 0.3220 0.000010 ... 0 0 0 0 0 153946 0.99100 0.459 187373 0.0698 0.103000 ... 0 0 1 0 0 216231 0.00280 0.658 267551 0.8970 0.000002 ... 0 0 0 0 0 271592 0.81300 0.692 177453 0.2010 0.000000 ... 0 0 0 0 0 265367 0.90500 0.158 277059 0.0521 0.869000 ... 0 0 0 0 0
[187139 rows x 28 columns], 18499 2 121197 2 4181 2 283240 3 163835 1 .. 159826 0 153946 3 216231 3 271592 0 265367 1 Name: popularity, Length: 187139, dtype: int64, None, message_clsname='Pipeline', message=None) Traceback (most recent call last): File "C:\Users\rjstr\Anaconda3\lib\site-packages\joblib\memory.py", line 516, in _cached_call verbose=self._verbose) File "C:\Users\rjstr\Anaconda3\lib\site-packages\joblib_store_backends.py", line 171, in load_item item = numpy_pickle.load(f) File "C:\Users\rjstr\Anaconda3\lib\site-packages\joblib\numpy_pickle.py", line 588, in load obj = _unpickle(fobj) File "C:\Users\rjstr\Anaconda3\lib\site-packages\joblib\numpy_pickle.py", line 526, in _unpickle obj = unpickler.load() File "C:\Users\rjstr\Anaconda3\lib\pickle.py", line 1085, in load dispatchkey[0] File "C:\Users\rjstr\Anaconda3\lib\site-packages\joblib\numpy_pickle.py", line 352, in load_build self.stack.append(array_wrapper.read(self)) File "C:\Users\rjstr\Anaconda3\lib\site-packages\joblib\numpy_pickle.py", line 195, in read array = self.read_array(unpickler) File "C:\Users\rjstr\Anaconda3\lib\site-packages\joblib\numpy_pickle.py", line 146, in read_array read_size, "array data") File "C:\Users\rjstr\Anaconda3\lib\site-packages\joblib\numpy_pickle_utils.py", line 235, in _read_bytes r = fp.read(size - len(data)) stopit.utils.TimeoutException
Sorry for the huge error dump. My data has 27 input features, 10 of them continuous, and then rest are binary. The outcome variable is an ordinal encoded ranging from 0, 1, 2, and 3, with a class balance of: 3: 76795 2: 74703 1: 70657 0: 70251 I did enough cleaning to confirm there are no erroneous values or nulls.
Thank you for your help!