TabNet overfits (help wanted, not a bug)

dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf

https://dreamquark-ai.github.io/tabnet/

MIT License

2.6k stars 482 forks source link

TabNet overfits (help wanted, not a bug) #522

Closed micheusch closed 9 months ago

micheusch commented 11 months ago

Model overfits severely, feature importance limited to less than 10 features

What is the current behavior? I'm solving a binary classification problem on a dozen rolling window monthly snapshots. My dataset has 70k rows, 100+ features. When solving with Random Forest or Gradient Boosting, feature importance spreads over a large number of features and remains consistent, with the boxplot of feature importance showing limited range of variability. With TabNet, each month's model non-zero feature importance is on a small (often less than 10) number of features which vary wildly from month to month, which I assume comes down to overfitting.

I tried a few options to reduce what could lead to over-fitting:

set n_d=n_a=8
also tried n_steps=2
augmentations using ClassificationSMOTE

but none of these seemed to help.

Expected behavior Month on month features used being more consistent and a larger number of features being considered. Is there anything obvious I may have done wrong to explain this behaviour? Many thanks!

Optimox commented 11 months ago

There is not enough information to consider you are in the over fitting realm : what is your train vs valid score ? the fact that features change from month to month simply shows that you have data shifts over time, not that it's not reasonable to rely on different features.

You can also set lambda_sparse to 0 to limit sparsity.

You can also limit number of epochs to avoid overfitting.

micheusch commented 11 months ago

thanks, I will investigate further and let you know!

micheusch commented 11 months ago

Fair enough, after plotting loss curves doesn’t look like over-fitting per se. See below for n_d=n_a=8, n_steps=2, lambda_sparse=e-6

Model is retrained monthly, but while there may be data shifts over time it shouldn’t be too dramatic as there’s only 2% of the time-period changing from one month to the next.

There could be variations from the random customer base used for the validation set, but again this is only 10% of the data. Not clear why this would be so much more erratic for TabNet than XGBoost. Boxplot below shows feature importance variation on 12 consecutive models

Looking at a few of the most unstable features, feature importance changes look a bit erratic.

Compare this to the same box-plot for XGBoost, features are picked up much more consistently.

Any thoughts on what could be causing these variations. Many thanks!

Optimox commented 11 months ago

Are you computing feature importance on the training set or the monthly validation set?

Do you have better predictive scores with XGBoost or TabNet ?

Are you sure that all your tabnet models converge correctly before the end of training?

micheusch commented 11 months ago

Hi again @Optimox,

Feature importances computed on training set as per here.

I have slightly better performance with XGBoost on most months.

I had max_epochs=100, patience=60. Now I've increased it to max_epochs=200 I have a mix of runs ending between 120 and 200 epochs, a few getting all the way to 200, although loss trajectory looks very flat.

Many thanks again!

Optimox commented 11 months ago

Are you using a learning rate scheduler ?

IMO, the best way to train neural networks is to tweak the learning rate and number of epochs so that you don't need to use early stopping anymore. With a good decay and number of epochs your model should reach its best validation score at the last epoch (or very close to best score). I do not know the size of your dataset here, but early stopping could be one explanation for the large differences you see on a monthly basis.

micheusch commented 11 months ago

Oh, interesting, thank you very much, will give it a shot

On Fri, 3 Nov 2023 at 16:36, Optimox @.***> wrote:

Are you using a learning rate scheduler ?

IMO, the best way to train neural networks is to tweak the learning rate and number of epochs so that you don't need to use early stopping anymore. With a good decay and number of epochs your model should reach its best validation score at the last epoch (or very close to best score). I do not know the size of your dataset here, but early stopping could be one explanation for the large differences you see on a monthly basis.

— Reply to this email directly, view it on GitHub https://github.com/dreamquark-ai/tabnet/issues/522#issuecomment-1792775411, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADMLK4QFPZKQL3Y5MM4NRRTYCUMRTAVCNFSM6AAAAAA6OBBA7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJSG43TKNBRGE . You are receiving this because you authored the thread.Message ID: @.***>

micheusch commented 10 months ago

Hi again, so, also getting this behaviour using a StepLR scheduler, training to 200 epochs, not using early stopping

Optimox commented 10 months ago

Then I don't know, are you sure you are always feeding your features in the same order and correctly attributing the features and the importance ?