dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf
https://dreamquark-ai.github.io/tabnet/
MIT License
2.65k stars 488 forks source link

Severe overfitting #525

Closed 979-Ryan closed 11 months ago

979-Ryan commented 1 year ago

Feature request

I test the TabNet model with dataset, where there are about 3k feature columns and about 1500k training samples, 300k validation samples. The model will always overfit and early stopping. I have set n_step=3,n_a=n_b=16 or 32, gamma=1.5 or 1.8, lambda_sparse = 0, 1e-3, 1e-2, 5e-2, batch_size = 1024 48, virtual_batch_size = 128 48. My loss function is mse, evaluation metric is Pearson correlation coefficient. During TabNet training, as the training set loss decreases, the validation set's loss initially declines but then fluctuates and rises. How to resolve this? Any other advice on regularization?

979-Ryan commented 1 year ago

my learning_rate strategy is optimizer_params = dict(lr=1e-1), scheduler_params = dict(T_0=100, T_mult=1, eta_min=1e-2),scheduler_fn=CosineAnnealingWarmRestarts with Adam optimizer function, patience=10

Optimox commented 1 year ago

Your learning rate is probably too high, also start with a simple learning rate decay like OneCycleLR

979-Ryan commented 1 year ago

Your learning rate is probably too high, also start with a simple learning rate decay like OneCycleLR

having tried smaller initial learning rate like 2e-2, 5e-2, the loss for training data didn't even decrease

eduardocarvp commented 1 year ago

Maybe worth having a look at the explanation matrices to check if some of the features are causing the overfit, for example some sort of index that has not been dropped. Without more details about the data it's probably going to be hard to diagnose exactly what might be happening.

Other than that the batch size strikes me as pretty large. Maybe that's the reason you have to use such a high learning rate.

979-Ryan commented 1 year ago

Maybe worth having a look at the explanation matrices to check if some of the features are causing the overfit, for example some sort of index that has not been dropped. Without more details about the data it's probably going to be hard to diagnose exactly what might be happening.

Other than that the batch size strikes me as pretty large. Maybe that's the reason you have to use such a high learning rate.

That's a possibility that some of the features cause the overfit indeed, but I've already configured lambda_sparse and gamma for regularization. Regarding the batch_size, I followed the original article's recommendation, setting it between 1% to 10% of the training set. Should I reduce it?

Optimox commented 1 year ago

Do you observe the same pattern with XGBoost or any other ML model ? If so, this is data related and not model related.

eduardocarvp commented 1 year ago

It is true that they use large batch sizes, up to 16K in the paper. The virtual batch size is always much smaller though, at 512 max.

979-Ryan commented 1 year ago

Do you observe the same pattern with XGBoost or any other ML model ? If so, this is data related and not model related.

lgbm performs much better with the same loss and evaluation metric

979-Ryan commented 1 year ago

It is also worth mentioning that the training is extremely slow, around 9-10min per epoch. Any advice on this?

Optimox commented 1 year ago

Do you have a GPU ?

Optimox commented 1 year ago

what happens with batch size = 2048, virtual batch size = 256 ?

979-Ryan commented 1 year ago

Do you have a GPU ?

Yes, training using Nvidia 3090. I haven't try batch_size smaller than 16384. Is training speed and learning rate strategy related to batch size? If i use a smaller batch size, should i lower the learning rate correspondingly? Thank you!

Optimox commented 1 year ago

Training speed is directly proportional to your batch size as long as 1) your gpu is not already reaching 100% usage 2) your cpu is NOT the bottleneck. After that, larger batch size will make the training slower.

Batch size and learning rate are related in theory yes. lr=1e-2, batch_size=1024, virtual_batch_size=256 and nothing else specified in the parameters never let me down. If this does not work at all I can't help you more unless you give access to your dataset

979-Ryan commented 1 year ago

Perhaps what you said above contradicts what you mentioned in this issue? [https://github.com/dreamquark-ai/tabnet/issues/391#issuecomment-1113099435]. After i reduce batch size to 4096 and virtual bath size to 512, training speed is slower.

Optimox commented 1 year ago

The larger your batch size, the faster your training is, where is the contradiction here ?

979-Ryan commented 1 year ago

Sorry, misunderstood what you have said.

979-Ryan commented 1 year ago

what happens with batch size = 2048, virtual batch size = 256 ?

Just tried a few epochs with batch size 4096 and virtual bath size 512. The performance is worse, the overfit is heavier.

979-Ryan commented 1 year ago

Training speed is directly proportional to your batch size as long as 1) your gpu is not already reaching 100% usage 2) your cpu is NOT the bottleneck. After that, larger batch size will make the training slower.

Batch size and learning rate are related in theory yes. lr=1e-2, batch_size=1024, virtual_batch_size=256 and nothing else specified in the parameters never let me down. If this does not work at all I can't help you more unless you give access to your dataset

Did you tried this setting for large datasets and large number of features?