Closed miteshyadav closed 6 years ago
I tested TPOTRegressor with KFold iterator but did not reproduce this issue. Please check the test codes below. Could you please provide more details about this issue?
from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, KFold
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
train_size=0.75, test_size=0.25)
cv = KFold(n_splits=3)
cv_iter = list(cv.split(X_train, y_train))
tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, cv=cv_iter)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
I have created my own custom validation set, unlike the one that has been mentioned by you which uses train_test_split.
`train_cv_df=pd.DataFrame(index=final_df.index,columns=final_df.columns) test_cv_df=pd.DataFrame(index=final_df.index,columns=final_df.columns) iter_cv=[]
tr_temp=[] val_temp=[]
for idx,row in final_df.iterrows():
try:
if (final_df.at[idx,'month']!=final_df.at[idx-1,'month']):
iter_cv.append((np.array(tr_temp),np.array(val_temp)))
tr_temp=[]
val_temp=[]
except Exception as e:
print (e)
if (final_df.at[idx,'day']) in list(range(1,15)):
#print ('oye')
train_cv_df=train_cv_df.append(pd.DataFrame(row).T,ignore_index=True)
#train_cv_df=pd.concat([train_cv_df,pd.DataFrame(row)],axis=0)
tr_temp.append(idx)
else:
test_cv_df=test_cv_df.append(pd.DataFrame(row).T,ignore_index=True)
#test_cv_df=pd.concat([test_cv_df,pd.DataFrame(row)],axis=0)
#val_+str(i).append(idx)
val_temp.append(idx)`
X_train=train_cv_df.iloc[:,train_cv_df.columns!='count'].values
Y_train=train_cv_df.iloc[:,6].values
X_test=test_cv_df.iloc[:,test_cv_df.columns!='count'].values
Y_test=test_cv_df.iloc[:,6].values`
The iter_cv is an iterbale which is a list of tuples containing indices of train and validation/test set. This iterable when used with the GridSearchCV method works fine, but when used with the TPOTRegressor method, throws the aforementioned error while running the fit method. Also, when replacing cv with a numeric constant, the program runs fine
Could you try to use the iter_cv
with the cross_val_score
instead?
TPOT should not change the input cv and just pass it to cross_val_score
.
Or could you please provide a minimum demo with a example dataset here to let us reproduce this issue?
The iter_cv works fine with the cross_val_score
Demo example:
'final_df' dataframe contains the pre-processed values as shown in the below figure
I have written a function that creates an iterable object (iter_cv) by looping through the dataframe and creating train-test splits such as train data includes all days with 1-14 and validation set includes the remaining part. Also, during the loop I have also created train_cv_df and test_cv_df which are dataframes that store the actual values of the split. This will be later fed to the TPOTRegressor
`for idx,row in final_df.iterrows():
try:
if (final_df.at[idx,'month']!=final_df.at[idx-1,'month']):
train_cv_df=train_cv_df.append(final_df.ix[tr_temp])
test_cv_df=test_cv_df.append(final_df.ix[val_temp])
iter_cv.append((np.array(tr_temp),np.array(val_temp)))
tr_temp=[]
val_temp=[]
except Exception as e:
print (e)
if (final_df.at[idx,'day']) in list(range(1,15)):
#print ('oye')
#train_cv_df.at[idx]
#train_cv_df=train_cv_df.append(pd.DataFrame(row),ignore_index=False)
#train_cv_df=pd.concat([train_cv_df,pd.DataFrame(row)],axis=0)
tr_temp.append(idx)
else:
#test_cv_df=test_cv_df.append(pd.DataFrame(row),ignore_index=False)
#test_cv_df=pd.concat([test_cv_df,pd.DataFrame(row)],axis=0)
#val_+str(i).append(idx)
val_temp.append(idx)
print (iter_cv)`
The train/validation set are then converted to numerical format
`X_train=train_cv_df.iloc[:,train_cv_df.columns!='count'].values
Y_train=train_cv_df.iloc[:,6].values
X_test=test_cv_df.iloc[:,test_cv_df.columns!='count'].values
Y_test=test_cv_df.iloc[:,6].values`
FInally, the TPOT function is used to run the AutoML
` from tpot import TPOTRegressor
tpot = TPOTRegressor(generations=10, population_size=50, verbosity=2,n_jobs=-1,cv=iter_cv) print ('aaaaaaaaaaaaaaaaa') tpot.fit(X_train, Y_train) print ('bbbbbbbbbbbbbbbb') print(tpot.score(X_test, Y_test)) print ('ccccccccccccccccccc') tpot.export('tpot_bike_rental.py')`
This throws the following error:
When the cv parameter is used without the iter_cv, i.e a numerical value, the code runs fine. I have also used the iter_cv iterable with cross_val_score and GridSearchCV and it works fine for both of them.
Hmm, it is weird. Could you upload tsv or csv file for final_df
here? I don't think I can reproduce this issue with some benchmarks.
Couldn't export it as a .cs so converted it into .xlsx PFA final_df.xlsx
I checked the codes and dataset. I found that there are many NAs in train_cv_df
and test_cv_df
so I drop these nans and then put them into TPOTRegressor. iter_cv
is 23 splits and the largest index is 10320 but the total row number in the train_X (after dropping these nans) is 7950. I think the largest index is out of limit. But if you used the first 15 splits of iter_cv
(tpot = TPOTRegressor(generations=10, population_size=50, verbosity=3,n_jobs=-1,cv=iter_cv[:15])
) in which no index is out of limit, then TPOT will works fine.
Yes I dropped the NAs too, forgot to mention that in the code. I think will have to check the value of the indices because the total values in the final_df dataframe is 10777 which do not match with the 10320 and get back to you. Anyway, thank you so much for your help, appreciate it
I have updated the code and now there is no discrepancy between the indices of iter_cv and train_cv_df and test_cv_df.
`train_cv_df=pd.DataFrame(columns=final_df.columns) test_cv_df=pd.DataFrame(columns=final_df.columns) iter_cv=[]
tr_temp=[] val_temp=[]
for idx,row in final_df.iterrows():
try:
if (final_df.at[idx,'month']!=final_df.at[idx-1,'month']):
train_cv_df=train_cv_df.append(final_df.ix[tr_temp])
test_cv_df=test_cv_df.append(final_df.ix[val_temp])
iter_cv.append((np.array(tr_temp),np.array(val_temp)))
tr_temp=[]
val_temp=[]
elif (idx==final_df.shape[0]-1):
print (idx)
if (final_df.at[idx,'day']) in list(range(1,15)):
tr_temp.append(idx)
else:
val_temp.append(idx)
train_cv_df=train_cv_df.append(final_df.ix[tr_temp])
test_cv_df=test_cv_df.append(final_df.ix[val_temp])
iter_cv.append((np.array(tr_temp),np.array(val_temp)))
except Exception as e:
print (e)
if (final_df.at[idx,'day']) in list(range(1,15)):
#print ('oye')
#train_cv_df.at[idx]
#train_cv_df=train_cv_df.append(pd.DataFrame(row),ignore_index=False)
#train_cv_df=pd.concat([train_cv_df,pd.DataFrame(row)],axis=0)
tr_temp.append(idx)
else:
#test_cv_df=test_cv_df.append(pd.DataFrame(row),ignore_index=False)
#test_cv_df=pd.concat([test_cv_df,pd.DataFrame(row)],axis=0)
#val_+str(i).append(idx)
val_temp.append(idx)
`
After deleting null values, the X_train rows come to 7950 and Y_test rows sum upto 2827 which sum upto the maximum value of iter_cv and final_df.shape which is 10777. The problem still persists. I am pretty sure that there is no mismatch in the index values
In your code tpot.fit(X_train, Y_train)
, X_train with 7950 rows was fed to tpot but the maximum value of iter_cv is 10777. So I still feel that it is a index mismatch issue. I think fitting on the whole final_df with iter_cv will works
Number of rows in X_train is 7950 but the index values are different and it falls within the max range of the iter_cv which is 10777. Also, the max value of iter_cv i.e 10777 is correct because that is the size of the final_df dataframe
If you take the example of the Boston dataset that you have provided, the no of rows in x_train and test are 379,127 respectively and they encompass subset of index values based on the split. Also, the length of the values of the train/validation indices are smaller than these values because 75% has been allocated to train and the remaining to test. .
OK, I understand the issue here now. I think reindexing train_cv_df
pandas dataframe and iter_cv
may help since so far tpot.fit() do not fully support the index of pandas dataframe as input for cv. We will add this support later.
Also, since your cv is specified in the iter_cv
and all indexes in the list are matched to the training subset from final_df
, you can use tpot.fit(final_df.iloc[:,final_df.columns != 'count'].values, final_df.iloc[:, 6].values)
instead so that CV in tpot still use samples in training set for pipeline evaluation. But after tpot.fit(), you need refit tpot.fitted_pipeline_
with your training set with command tpot.fitted_pipeline_.fit(X_train, Y_train)
.
That is a good idea. I tried doing it and it works fine. However, I have one concern. Wouldn't it be fallacious to choose the classifier by training it on the whole dataset rather than dividing it into part train and part validation set? Or does the TPOT functionality evaluates the best regressor based on the train/test split given by iter_cv?
Yes, TPOT functionality evaluates the best regressor based on the train/test split given by iter_cv
But using the .fit() function we are training the whole dataset, including the validation set. Wouldn't that be fallacious?
The alternative way is just a work around for these indexes in your case. The best way is to reindex the training set to 0-7949 in this case.
Thank you once again for bearing with me. Appreciate it. I am closing this issue now
No problem, I think it is a good issue that we need fix for updated version of TPOT.
Cross validation parameter bug in TPOTRegressor using iterable object while creating a customized validation set
Context of the issue
When using the iterable object, the function throws an error without specifying the details of the error. When I use the same iterable object in the GridSearchCV method, it was found that it works fine. Also, when using the CV iterable function, the train/splitting shouldn't be passed explicitly by the user because that has already been specified in the iterable object and the algorithm should generate the scores based on the values of the iterable object. The error shows that it arose due to data formatting error. However, if I change the CV with an integer object and run it again, it works fine
Process to reproduce the issue
TPOTRegressor
function with iterable object