EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.76k stars 1.57k forks source link

The CV splitting leads to ValueError #1128

Closed Zeroto521 closed 4 years ago

Zeroto521 commented 4 years ago

Context of the issue & Process to reproduce the issue

There was a data called y, its shape was (n, ) and its type was pd.Series.

Its value_counts function result likes below.

5.0     319
6.0     266
4.0     252
7.0     217
3.0     210
2.0     159
8.0     127
1.0     106
9.0      66
10.0     16
11.0      1
12.0      1
Name: Label, dtype: int64

Then let try to use topt.fit function.

tpot = TPOTClassifier()
tpot.fit(X, y)

The Error came out.

Traceback (most recent call last):

  File "C:\Users\admin\Documents\XX\scripts\model.py", line 31, in <module>
    tpot.fit(X, y)

  File "C:\Users\admin\miniconda3\envs\data\lib\site-packages\tpot\base.py", line 645, in fit
    self._init_pretest(features, target)

  File "C:\Users\admin\miniconda3\envs\data\lib\site-packages\tpot\tpot.py", line 59, in _init_pretest
    stratify=target

  File "C:\Users\admin\miniconda3\envs\data\lib\site-packages\sklearn\model_selection\_split.py", line 2152, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))

  File "C:\Users\admin\miniconda3\envs\data\lib\site-packages\sklearn\model_selection\_split.py", line 1341, in split
    for train, test in self._iter_indices(X, y, groups):

  File "C:\Users\admin\miniconda3\envs\data\lib\site-packages\sklearn\model_selection\_split.py", line 1668, in _iter_indices
    raise ValueError("The least populated class in y has only 1"

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

Possible fix

So I looked into the source code and found a way how to fix it.

https://github.com/EpistasisLab/tpot/blob/219f8c5abe43996abb2c19d6a1767083304a23d3/tpot/tpot.py#L53-L60

stratify=target should be the problem.

From the data value_counts, we know 11 and 12 only show one time.

After deleting or noting that parameter, the thing is alright

We should deal with this condition.

weixuanfu commented 4 years ago

Thank you for reporting this issue.

Since there are codes (see here) to ensure that there is a least one example from each class and I think it is safe to delete that parameter to avoid this issue.

weixuanfu commented 4 years ago

I fixed it via PR #1129 and it will be merged to development branch soon. And It will be included in next release of TPOT later this month.

For testing the development branch, you may install TPOT with patch into your environment via:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/EpistasisLab/tpot.git@development
Zeroto521 commented 4 years ago

I will try this later.

Zeroto521 commented 4 years ago

It seems no problem in these days.