EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.65k stars 1.56k forks source link

Cross validation parameter bug in TPOTRegressor using iterable object #647

Closed miteshyadav closed 6 years ago

miteshyadav commented 6 years ago

Cross validation parameter bug in TPOTRegressor using iterable object while creating a customized validation set

Context of the issue

When using the iterable object, the function throws an error without specifying the details of the error. When I use the same iterable object in the GridSearchCV method, it was found that it works fine. Also, when using the CV iterable function, the train/splitting shouldn't be passed explicitly by the user because that has already been specified in the iterable object and the algorithm should generate the scores based on the values of the iterable object. The error shows that it arose due to data formatting error. However, if I change the CV with an integer object and run it again, it works fine

Process to reproduce the issue

  1. User creates TPOT instance
  2. User calls TPOT TPOTRegressor function with iterable object
  3. TPOT crashes with an error

image

weixuanfu commented 6 years ago

I tested TPOTRegressor with KFold iterator but did not reproduce this issue. Please check the test codes below. Could you please provide more details about this issue?

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, KFold

housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25)
cv = KFold(n_splits=3)
cv_iter = list(cv.split(X_train, y_train))
tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, cv=cv_iter)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
miteshyadav commented 6 years ago

I have created my own custom validation set, unlike the one that has been mentioned by you which uses train_test_split.

`train_cv_df=pd.DataFrame(index=final_df.index,columns=final_df.columns) test_cv_df=pd.DataFrame(index=final_df.index,columns=final_df.columns) iter_cv=[]

tr_temp=[] val_temp=[]

for idx,row in final_df.iterrows():

print (pd.DataFrame(row).T)

try:
    if (final_df.at[idx,'month']!=final_df.at[idx-1,'month']):

        iter_cv.append((np.array(tr_temp),np.array(val_temp)))

        tr_temp=[]
        val_temp=[]

except Exception as e:
    print (e)

if (final_df.at[idx,'day']) in list(range(1,15)):
    #print ('oye')
    train_cv_df=train_cv_df.append(pd.DataFrame(row).T,ignore_index=True)
    #train_cv_df=pd.concat([train_cv_df,pd.DataFrame(row)],axis=0)
    tr_temp.append(idx)
else:
    test_cv_df=test_cv_df.append(pd.DataFrame(row).T,ignore_index=True)
    #test_cv_df=pd.concat([test_cv_df,pd.DataFrame(row)],axis=0)
    #val_+str(i).append(idx)
    val_temp.append(idx)`

X_train=train_cv_df.iloc[:,train_cv_df.columns!='count'].values

Y_train=train_cv_df.iloc[:,6].values

X_test=test_cv_df.iloc[:,test_cv_df.columns!='count'].values

Y_test=test_cv_df.iloc[:,6].values`

The iter_cv is an iterbale which is a list of tuples containing indices of train and validation/test set. This iterable when used with the GridSearchCV method works fine, but when used with the TPOTRegressor method, throws the aforementioned error while running the fit method. Also, when replacing cv with a numeric constant, the program runs fine

weixuanfu commented 6 years ago

Could you try to use the iter_cv with the cross_val_score instead?

TPOT should not change the input cv and just pass it to cross_val_score.

Or could you please provide a minimum demo with a example dataset here to let us reproduce this issue?

miteshyadav commented 6 years ago

The iter_cv works fine with the cross_val_score image

Demo example:

'final_df' dataframe contains the pre-processed values as shown in the below figure image

I have written a function that creates an iterable object (iter_cv) by looping through the dataframe and creating train-test splits such as train data includes all days with 1-14 and validation set includes the remaining part. Also, during the loop I have also created train_cv_df and test_cv_df which are dataframes that store the actual values of the split. This will be later fed to the TPOTRegressor

`for idx,row in final_df.iterrows():

print (pd.DataFrame(row).T)

try:
    if (final_df.at[idx,'month']!=final_df.at[idx-1,'month']):

        train_cv_df=train_cv_df.append(final_df.ix[tr_temp])

        test_cv_df=test_cv_df.append(final_df.ix[val_temp])

        iter_cv.append((np.array(tr_temp),np.array(val_temp)))

        tr_temp=[]
        val_temp=[]

except Exception as e:
    print (e)

if (final_df.at[idx,'day']) in list(range(1,15)):
    #print ('oye')
    #train_cv_df.at[idx]
    #train_cv_df=train_cv_df.append(pd.DataFrame(row),ignore_index=False)
    #train_cv_df=pd.concat([train_cv_df,pd.DataFrame(row)],axis=0)
    tr_temp.append(idx)

else:
    #test_cv_df=test_cv_df.append(pd.DataFrame(row),ignore_index=False)
    #test_cv_df=pd.concat([test_cv_df,pd.DataFrame(row)],axis=0)
    #val_+str(i).append(idx)
    val_temp.append(idx)

print (iter_cv)`

The train/validation set are then converted to numerical format

`X_train=train_cv_df.iloc[:,train_cv_df.columns!='count'].values

Y_train=train_cv_df.iloc[:,6].values

X_test=test_cv_df.iloc[:,test_cv_df.columns!='count'].values

Y_test=test_cv_df.iloc[:,6].values`

FInally, the TPOT function is used to run the AutoML

` from tpot import TPOTRegressor

tpot = TPOTRegressor(generations=10, population_size=50, verbosity=2,n_jobs=-1,cv=iter_cv) print ('aaaaaaaaaaaaaaaaa') tpot.fit(X_train, Y_train) print ('bbbbbbbbbbbbbbbb') print(tpot.score(X_test, Y_test)) print ('ccccccccccccccccccc') tpot.export('tpot_bike_rental.py')`

This throws the following error: image

When the cv parameter is used without the iter_cv, i.e a numerical value, the code runs fine. I have also used the iter_cv iterable with cross_val_score and GridSearchCV and it works fine for both of them.

weixuanfu commented 6 years ago

Hmm, it is weird. Could you upload tsv or csv file for final_df here? I don't think I can reproduce this issue with some benchmarks.

miteshyadav commented 6 years ago

Couldn't export it as a .cs so converted it into .xlsx PFA final_df.xlsx

weixuanfu commented 6 years ago

I checked the codes and dataset. I found that there are many NAs in train_cv_df and test_cv_df so I drop these nans and then put them into TPOTRegressor. iter_cv is 23 splits and the largest index is 10320 but the total row number in the train_X (after dropping these nans) is 7950. I think the largest index is out of limit. But if you used the first 15 splits of iter_cv (tpot = TPOTRegressor(generations=10, population_size=50, verbosity=3,n_jobs=-1,cv=iter_cv[:15])) in which no index is out of limit, then TPOT will works fine.

miteshyadav commented 6 years ago

Yes I dropped the NAs too, forgot to mention that in the code. I think will have to check the value of the indices because the total values in the final_df dataframe is 10777 which do not match with the 10320 and get back to you. Anyway, thank you so much for your help, appreciate it

miteshyadav commented 6 years ago

I have updated the code and now there is no discrepancy between the indices of iter_cv and train_cv_df and test_cv_df.

`train_cv_df=pd.DataFrame(columns=final_df.columns) test_cv_df=pd.DataFrame(columns=final_df.columns) iter_cv=[]

tr_temp=[] val_temp=[]

for idx,row in final_df.iterrows():

print (pd.DataFrame(row).T)

try:
    if (final_df.at[idx,'month']!=final_df.at[idx-1,'month']):

        train_cv_df=train_cv_df.append(final_df.ix[tr_temp])

        test_cv_df=test_cv_df.append(final_df.ix[val_temp])

        iter_cv.append((np.array(tr_temp),np.array(val_temp)))

        tr_temp=[]
        val_temp=[]

    elif (idx==final_df.shape[0]-1):
        print (idx)

        if (final_df.at[idx,'day']) in list(range(1,15)):

            tr_temp.append(idx)
        else:
            val_temp.append(idx)
        train_cv_df=train_cv_df.append(final_df.ix[tr_temp])

        test_cv_df=test_cv_df.append(final_df.ix[val_temp])
        iter_cv.append((np.array(tr_temp),np.array(val_temp)))

except Exception as e:
    print (e)

if (final_df.at[idx,'day']) in list(range(1,15)):
    #print ('oye')
    #train_cv_df.at[idx]
    #train_cv_df=train_cv_df.append(pd.DataFrame(row),ignore_index=False)
    #train_cv_df=pd.concat([train_cv_df,pd.DataFrame(row)],axis=0)
    tr_temp.append(idx)

else:
    #test_cv_df=test_cv_df.append(pd.DataFrame(row),ignore_index=False)
    #test_cv_df=pd.concat([test_cv_df,pd.DataFrame(row)],axis=0)
    #val_+str(i).append(idx)
    val_temp.append(idx)

`

After deleting null values, the X_train rows come to 7950 and Y_test rows sum upto 2827 which sum upto the maximum value of iter_cv and final_df.shape which is 10777. The problem still persists. I am pretty sure that there is no mismatch in the index values

weixuanfu commented 6 years ago

In your code tpot.fit(X_train, Y_train), X_train with 7950 rows was fed to tpot but the maximum value of iter_cv is 10777. So I still feel that it is a index mismatch issue. I think fitting on the whole final_df with iter_cv will works

miteshyadav commented 6 years ago

Number of rows in X_train is 7950 but the index values are different and it falls within the max range of the iter_cv which is 10777. Also, the max value of iter_cv i.e 10777 is correct because that is the size of the final_df dataframe image

If you take the example of the Boston dataset that you have provided, the no of rows in x_train and test are 379,127 respectively and they encompass subset of index values based on the split. Also, the length of the values of the train/validation indices are smaller than these values because 75% has been allocated to train and the remaining to test. image.

weixuanfu commented 6 years ago

OK, I understand the issue here now. I think reindexing train_cv_df pandas dataframe and iter_cv may help since so far tpot.fit() do not fully support the index of pandas dataframe as input for cv. We will add this support later.

Also, since your cv is specified in the iter_cv and all indexes in the list are matched to the training subset from final_df, you can use tpot.fit(final_df.iloc[:,final_df.columns != 'count'].values, final_df.iloc[:, 6].values) instead so that CV in tpot still use samples in training set for pipeline evaluation. But after tpot.fit(), you need refit tpot.fitted_pipeline_ with your training set with command tpot.fitted_pipeline_.fit(X_train, Y_train).

miteshyadav commented 6 years ago

That is a good idea. I tried doing it and it works fine. However, I have one concern. Wouldn't it be fallacious to choose the classifier by training it on the whole dataset rather than dividing it into part train and part validation set? Or does the TPOT functionality evaluates the best regressor based on the train/test split given by iter_cv?

weixuanfu commented 6 years ago

Yes, TPOT functionality evaluates the best regressor based on the train/test split given by iter_cv

miteshyadav commented 6 years ago

But using the .fit() function we are training the whole dataset, including the validation set. Wouldn't that be fallacious?

weixuanfu commented 6 years ago

The alternative way is just a work around for these indexes in your case. The best way is to reindex the training set to 0-7949 in this case.

miteshyadav commented 6 years ago

Thank you once again for bearing with me. Appreciate it. I am closing this issue now

weixuanfu commented 6 years ago

No problem, I think it is a good issue that we need fix for updated version of TPOT.