Closed 979-Ryan closed 11 months ago
my learning_rate strategy is optimizer_params = dict(lr=1e-1), scheduler_params = dict(T_0=100, T_mult=1, eta_min=1e-2),scheduler_fn=CosineAnnealingWarmRestarts with Adam optimizer function, patience=10
Your learning rate is probably too high, also start with a simple learning rate decay like OneCycleLR
Your learning rate is probably too high, also start with a simple learning rate decay like OneCycleLR
having tried smaller initial learning rate like 2e-2, 5e-2, the loss for training data didn't even decrease
Maybe worth having a look at the explanation matrices to check if some of the features are causing the overfit, for example some sort of index that has not been dropped. Without more details about the data it's probably going to be hard to diagnose exactly what might be happening.
Other than that the batch size strikes me as pretty large. Maybe that's the reason you have to use such a high learning rate.
Maybe worth having a look at the explanation matrices to check if some of the features are causing the overfit, for example some sort of index that has not been dropped. Without more details about the data it's probably going to be hard to diagnose exactly what might be happening.
Other than that the batch size strikes me as pretty large. Maybe that's the reason you have to use such a high learning rate.
That's a possibility that some of the features cause the overfit indeed, but I've already configured lambda_sparse and gamma for regularization. Regarding the batch_size, I followed the original article's recommendation, setting it between 1% to 10% of the training set. Should I reduce it?
Do you observe the same pattern with XGBoost or any other ML model ? If so, this is data related and not model related.
It is true that they use large batch sizes, up to 16K in the paper. The virtual batch size is always much smaller though, at 512 max.
Do you observe the same pattern with XGBoost or any other ML model ? If so, this is data related and not model related.
lgbm performs much better with the same loss and evaluation metric
It is also worth mentioning that the training is extremely slow, around 9-10min per epoch. Any advice on this?
Do you have a GPU ?
what happens with batch size = 2048, virtual batch size = 256 ?
Do you have a GPU ?
Yes, training using Nvidia 3090. I haven't try batch_size smaller than 16384. Is training speed and learning rate strategy related to batch size? If i use a smaller batch size, should i lower the learning rate correspondingly? Thank you!
Training speed is directly proportional to your batch size as long as 1) your gpu is not already reaching 100% usage 2) your cpu is NOT the bottleneck. After that, larger batch size will make the training slower.
Batch size and learning rate are related in theory yes. lr=1e-2, batch_size=1024, virtual_batch_size=256 and nothing else specified in the parameters never let me down. If this does not work at all I can't help you more unless you give access to your dataset
Perhaps what you said above contradicts what you mentioned in this issue? [https://github.com/dreamquark-ai/tabnet/issues/391#issuecomment-1113099435]. After i reduce batch size to 4096 and virtual bath size to 512, training speed is slower.
The larger your batch size, the faster your training is, where is the contradiction here ?
Sorry, misunderstood what you have said.
what happens with batch size = 2048, virtual batch size = 256 ?
Just tried a few epochs with batch size 4096 and virtual bath size 512. The performance is worse, the overfit is heavier.
Training speed is directly proportional to your batch size as long as 1) your gpu is not already reaching 100% usage 2) your cpu is NOT the bottleneck. After that, larger batch size will make the training slower.
Batch size and learning rate are related in theory yes. lr=1e-2, batch_size=1024, virtual_batch_size=256 and nothing else specified in the parameters never let me down. If this does not work at all I can't help you more unless you give access to your dataset
Did you tried this setting for large datasets and large number of features?
Feature request
I test the TabNet model with dataset, where there are about 3k feature columns and about 1500k training samples, 300k validation samples. The model will always overfit and early stopping. I have set n_step=3,n_a=n_b=16 or 32, gamma=1.5 or 1.8, lambda_sparse = 0, 1e-3, 1e-2, 5e-2, batch_size = 1024 48, virtual_batch_size = 128 48. My loss function is mse, evaluation metric is Pearson correlation coefficient. During TabNet training, as the training set loss decreases, the validation set's loss initially declines but then fluctuates and rises. How to resolve this? Any other advice on regularization?