dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf
https://dreamquark-ai.github.io/tabnet/
MIT License
2.55k stars 470 forks source link

Different classification variables in the test set and train set #529

Closed labxpub closed 6 months ago

labxpub commented 6 months ago

Describe the bug

File ~/anaconda3/envs/deepl/lib/python3.11/site-packages/pytorch_tabnet/abstract_model.py:258, in TabModel.fit(self, X_train, y_train, eval_set, eval_name, eval_metric, loss_fn, weights, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last, callbacks, pin_memory, from_unsupervised, warm_start, augmentations, compute_importance) 253 for epoch_idx in range(self.max_epochs): 254 255 # Call method on_epoch_begin for all callbacks 256 self._callback_container.on_epoch_begin(epoch_idx) --> 258 self._train_epoch(train_dataloader) 260 # Apply predict epoch to all eval sets 261 for eval_name, valid_dataloader in zip(eval_names, valid_dataloaders):

File ~/anaconda3/envs/deepl/lib/python3.11/site-packages/pytorch_tabnet/abstract_model.py:489, in TabModel._train_epoch(self, train_loader) 486 for batch_idx, (X, y) in enumerate(train_loader): 487 self._callback_container.on_batch_begin(batch_idx) --> 489 batch_logs = self._train_batch(X, y) 491 self._callback_container.on_batch_end(batch_idx, batch_logs) 493 epoch_logs = {"lr": self._optimizer.param_groups[-1]["lr"]} ... 2231 # remove once script supports set_grad_enabled 2232 _no_grad_embeddingrenorm(weight, input, max_norm, norm_type) -> 2233 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self

What is the current behavior? Hi, I am using Tabnetclassifier,is a good tool. But I'm having some problems with IndexError: index out of range in self. I browsed through the previous questions and learnt that it might be because the categories of the categorical variables in the test set are more than the ones in the train set. Then the solution I've come up with so far is to specify cat_dims as the number of dimensions in the whole dataset, which obviously includes both test and train, but it doesn't seem to work yet. Because my dataset is relatively small, it's inevitable that some variables appear differently in the training and test sets. I wonder if you guys have any suggestions for fixing this?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information: poetry version:
python version: Operating System: Additional tools:

Additional context

Optimox commented 6 months ago

Hello,

This is not a bug, and even if you could get rid of an error during training and inference by setting your embedding sizes to a large value that would not solve your problem, only silence it.

You can't hope for any model to predict something meaningful to an integer encoded new category. You will simply generate a random representation and make predictions out of noise: garbage in, garbage out.

You need to decide on your pipeline what happens with "new" or "unknown" category. There is a vast amount of options you can pick: replace any new category by the most frequent one, create a "rare values" category during your training and set new categories to this value and many more. That's something you need to deal on your own pipeline and is not taken care of by the library as it's important to understand and have control on this.