RandomizedSearchCV with RepeatedStratifiedKFold

FernandoAMDM commented 3 years ago

Describe the bug

when I try to run TabNetClassifier with RandomizedSearchCV (from sci-kit learn) I get the following error:

"The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()"

Is there any way to run repeated stratified cross validation to search for hyperparameters or I will have to create a validation set (which is not ideal for my case) ?

from pytorch_tabnet.tab_model import TabNetClassifier from scipy.stats import randint, uniform, loguniform

tabnet_params = {"n_steps": randint(3,10), 'gamma': uniform(1,2), 'n_independent': randint(1,5), 'n_shared': randint(1,5), 'momentum': loguniform(0.01, 0.4), 'optimizer_fn': ['torch.optim.Adam', 'torch.optim.Adadelta', 'torch.optim.RMSprop', 'torch.optim.LBFGS']}

from sklearn.model_selection import RepeatedStratifiedKFold, RandomizedSearchCV from sklearn.metrics import roc_auc_score, f1_score, accuracy_score

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3)

tabnet = RandomizedSearchCV(TabNetClassifier(seed=20, verbose=0), tabnet_params, cv=cv, scoring='f1', verbose=10, n_iter=100, n_jobs=3) tabnet.fit(x_train, y_train, patience=30)

Error message:

ValueError Traceback (most recent call last)

in 1 tabnet = RandomizedSearchCV(TabNetClassifier(seed=20, verbose=0), tabnet_params, cv=cv, scoring='f1', verbose=10, n_iter=100, n_jobs=3) ----> 2 tabnet.fit(x_train, y_train, patience=30) C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(*args, **kwargs) 64 65 # extra_args > 0 C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params) 878 refit_start_time = time.time() 879 if y is not None: --> 880 self.best_estimator_.fit(X, y, **fit_params) 881 else: 882 self.best_estimator_.fit(X, **fit_params) C:\ProgramData\Anaconda3\lib\site-packages\pytorch_tabnet\abstract_model.py in fit(self, X_train, y_train, eval_set, eval_name, eval_metric, loss_fn, weights, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last, callbacks, pin_memory, from_unsupervised) 183 check_array(X_train) 184 --> 185 self.update_fit_params( 186 X_train, 187 y_train, C:\ProgramData\Anaconda3\lib\site-packages\pytorch_tabnet\tab_model.py in update_fit_params(self, X_train, y_train, eval_set, weights) 50 weights, 51 ): ---> 52 output_dim, train_labels = infer_output_dim(y_train) 53 for X, y in eval_set: 54 check_output_dim(train_labels, y) C:\ProgramData\Anaconda3\lib\site-packages\pytorch_tabnet\multiclass_utils.py in infer_output_dim(y_train) 370 Sorted list of initial classes 371 """ --> 372 check_unique_type(y_train) 373 train_labels = unique_labels(y_train) 374 output_dim = len(train_labels) C:\ProgramData\Anaconda3\lib\site-packages\pytorch_tabnet\multiclass_utils.py in check_unique_type(y) 347 348 def check_unique_type(y): --> 349 target_types = pd.Series(y).map(type).unique() 350 if len(target_types) != 1: 351 raise TypeError( C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in __init__(self, data, index, dtype, name, copy, fastpath) 266 name = ibase.maybe_extract_name(name, data, type(self)) 267 --> 268 if is_empty_data(data) and dtype is None: 269 # gh-17261 270 warnings.warn( C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\construction.py in is_empty_data(data) 626 is_none = data is None 627 is_list_like_without_dtype = is_list_like(data) and not hasattr(data, "dtype") --> 628 is_simple_empty = is_list_like_without_dtype and not data 629 return is_none or is_simple_empty 630 C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self) 1440 @final 1441 def __nonzero__(self): -> 1442 raise ValueError( 1443 f"The truth value of a {type(self).__name__} is ambiguous. " 1444 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Optimox commented 3 years ago

I think your issue is actually a duplicate of this one #80.

Please read this issue and feel free to reopen if that's not a duplicate.

FernandoAMDM commented 3 years ago

Thanks @Optimox , I had read #80 before I opened this issue, but I could make Hartorn's method work with 1 fold, but not with multiple folds, or even repeating the separation. His k-fold cv happens after the hyperparameters have be chosen, but I would want to implement it to choose them. So I would like the issue to be reopened, to see if anyone could have any tips about this problem

Optimox commented 3 years ago

Hmm sure I'll leave it open.

I'm not sure I understand why it does not relate but someone else might come for help!

dreamquark-ai / tabnet

RandomizedSearchCV with RepeatedStratifiedKFold #308