dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf
https://dreamquark-ai.github.io/tabnet/
MIT License
2.65k stars 488 forks source link

Struggling to get model to fit - Help Wanted #530

Closed chadbreece closed 11 months ago

chadbreece commented 11 months ago

Describe the bug Can't get TabNetRegressor to overfit or stabilize.

What is the current behavior? Dataset is 4M rows, ~150 features with a 60/20/20 train/val/test split. I have tried increasing the model complexity and I still cannot get it to work as intended even overfit intentionally.

Here is the Train/Val eval graph: image

and here is the same data run on XGBoost: (note that this had the first 20 boost rounds removed to highlight that the model is overfitting to the train set) image

The hyperparams I used for this TabNet run are below using OneCycleLR and found via a Hyperopt run (which didn't help much)

{'cat_emb_dim': [8, 4, 1], 'gamma': 1.5, 'learning_rate': 0.0076031329945881925, (whether this is big or small I see this behavior) 'mask_type': 'sparsemax', 'max_lr': 0.15911336214731228, (whether this is big or small I see this behavior) 'n_d_a': 8, 'n_steps': 3.0, 'pct_start': 0.15}

I've read through other tickets on here but haven't seen anyone struggling to get overfitting, usually the opposite. Any advice is appreciated.

If the current behavior is a bug, please provide the steps to reproduce. N/A

Expected behavior Overfitting.

Screenshots See above.

Other relevant information: poetry version:
python version: 3.7 Operating System: Additional tools:

Additional context N/A

chadbreece commented 11 months ago

Also, do you have any pointers for hyperparameters while scaling up? The full dataset I am trying to run is ~40M rows, I'm trying to tune the hyperparameters on this 10% sample before applying those parameters to the full dataset.

Optimox commented 11 months ago

Your train/val plot looks suspicious to me: that is strange that train and valid have the exact same scores at every epochs.

Have you tried a learning decay?

chadbreece commented 11 months ago

I'm using OneCycleLR right now, open to suggestions to try (scheduler or lr values) and I can follow up with results here.

Also, I have been using log cosh loss as my objective, MASE as my eval. My regression target is heavily right skewed so I recently tried RMSLE but that didn't change the dynamic you see above.

Here is an example where I trained with log cosh loss as my objective and use MAE as my eval (ignore the legend), this is a bit better but still pretty volatile. image

chadbreece commented 11 months ago

I tried reducing the number of epochs and pct_start in OneCycleLR and got the following image

Much more stable but still not seeing the training MASE getting much better than validation.

More playing yielded more of the same. XGBoost is often able to get down to <0.4 MASE but I can't seem to get tabnet below ~0.45 image

Optimox commented 11 months ago

A large batch size often plays the role of a regularization method because of the batch norm used during training. At the cost of a longer training time you can try to significantly lower batch_size and virtual_batch_size (like 64). I'm not sure you'll get better validation performance but you should be able to see some overfitting.

chadbreece commented 11 months ago

Doing this, I still was unable to get the model to overfit... Are there any other hyperparameters I should be looking to change to help this?

Optimox commented 11 months ago

larger n_d, n_a, larger number of steps: larger model capacity should enable overfitting capacity.