dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf
https://dreamquark-ai.github.io/tabnet/
MIT License
2.61k stars 485 forks source link

Running out of memory during training #431

Closed Kayne88 closed 2 years ago

Kayne88 commented 2 years ago

When training with custom eval metric (pearson corr), after first evaluation my colab session runs out of memory.

What is the current behavior? Training of TabNetRegressor starts fine and after first evaluation round, I run out of memory. I am training the model on GPU 16GB and free RAM is approx 40 GB. The RAM consumption during training steadily increases. I am training on a pretty large dataset (11 GB)

Expected behavior

I would expect that the RAM consumption is more or less constant during training, once the model is initialized.

Screenshots

max_epochs = 2
batch_size = 1028
model = TabNetRegressor(
                       optimizer_fn=torch.optim.Adam,
                       optimizer_params=dict(lr=1e-2)
                      )

model.fit(
    X_train=factors_train[features].to_numpy(), y_train=factors_train.target.to_numpy().reshape((-1,1)),
    eval_set=[(factors_test[features].to_numpy(), factors_test.target.to_numpy().reshape((-1,1)))],
    eval_name=['test'],
    eval_metric=[PearsonCorrMetric],
    max_epochs=max_epochs , patience=5,
    batch_size=batch_size,
    virtual_batch_size=128,
    num_workers=0,
    drop_last=False
)

class PearsonCorrMetric(Metric):
  def __init__(self):
    self._name = "pearson_corr"
    self._maximize = True

def __call__(self, y_true, y_score):
    return corr_score(y_true, y_score)[1]

def corr_score(y_true, y_pred):
    return "score", np.corrcoef(y_true, y_pred)[0,1], True

Other relevant information: poetry version: ? python version: 3.8 Operating System: Ubuntu Additional tools:

Additional context

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    24W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Optimox commented 2 years ago

The consumption should attain a peak at the end of an epoch.

Do you manage to get the pearson correlation score for the first epoch?

Does it work if you reduce the batch size ?

Kayne88 commented 2 years ago

Do you manage to get the pearson correlation score for the first epoch? No, I can't see the evaluation of first epoch

Does it work if you reduce the batch size ? I tried initially with 256 batch_size, also out of RAM

Optimox commented 2 years ago

Is it GPU OOM or RAM OOM ?

Kayne88 commented 2 years ago

RAM OOM. It basically jumps for 30GB consumption over 52GB consumption

Could this be related to the custom metric? Might it help if I implement the metric with torch rather than np, so it can use the GPU?

Optimox commented 2 years ago

Could this be related to the custom metric?

I think it's unlikely, but you can try rmse and see if it solves the problem.

What is the size of your train/test ? in number of rows and colums?

Kayne88 commented 2 years ago

TRAIN (1914562, 1214) - TEST (476390, 1214) RMSE actually works :)

Optimox commented 2 years ago

I'd be happy to know if you get competitive results on your dataset with tabnet. Please leave a comment if you can :)

Kayne88 commented 2 years ago

With pleasure. However I first need to make the corr metric work. RMSE is not appropriate for my problem. Also, I would really like to use a custom loss eventually.

But with pleasure, I can give a comparison with my current catboost benchmark scores to tabnet.

Optimox commented 2 years ago

Can't you use a simple pearson correlation ?

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html

Kayne88 commented 2 years ago

I tried to use the sklearn r2_score, also OOM. I suspect the problem is that during eval metric calculation the model, tensors and data are moved to cpu. One way could be to explicitly transfer them to cuda in the metric calculation.

What is working for me now is the implementation here: https://torchmetrics.readthedocs.io/en/stable/regression/pearson_corr_coef.html

First runs look very promising, after only 15 epochs I come close to catboost performance (which is hyperparam optimized). Real comparison will come on full validation (separate from train and test) which is almost as large as whole train.

One drawback of tabnet is that hyperparam optimization (with optuna) will take a very long time already for 100 trials. I need to see how to best approach that topic.

Keep you updated

PS: What I observe during training with fixed LR that for 1-2 epochs the eval metric is "oscilating" and then does a significant improvement. I am not very experienced with LR schedulers but decided to give OneCycleLR a try. Maybe it smoothes the training.

Optimox commented 2 years ago

yes I would advise to decay with OneCycleLR. This will make the model converge in fewer epochs.

Thanks for the updates!

Kayne88 commented 2 years ago

Here are some intermediate results and comparison with catboost benchmark. Ive applied shallow hyperparam optimization to tabnet. Things to note, the data set has very low signal to noise ratio, it's from the financial context, where the target is some performance measure of an asset to be predicted. Adequate basic metrics for such a problem are different kinds of correlations. The comparison are done on a large validation set, which has almost the size of the train set. The task is regression

CATBOOST

PREDS - pearson correlation 0.031141676801666244 - feature neutral corrleation 0.02642358893294882 PREDS NEUTRALIZED - pearson correlation 0.028844560064221897 - feature neutral correlation 0.026562891162170366

TABNET

PREDS - pearson correlation 0.02533170902252626 - spearman corr 0.02516739791397788 - fnc 0.021378226358012287 PREDS NEUTRALIZED - pearson correlation 0.021450115592071817 - spearman corr 0.020884041941114838 - fnc 0.020596744887857364

We can see that the metrics fall of by quite some margin, however tabnet achieves the best performance among other deep learning architectures (tabtransformer, resnet). Another thing to note is that the pearson correlation between the catboost predictions and tabnet predictions is roughly 0.66, which is not tremendously high. So it seems that tabnet learns a different signal than catboost.

Current flaws:

Current hyperparam grid:

param_grid = {
      "optimizer_fn": torch.optim.AdamW,
      "optimizer_params": dict(lr=0.017),
      "scheduler_fn": torch.optim.lr_scheduler.CosineAnnealingWarmRestarts,
      "scheduler_params": dict(T_0=200, T_mult=1, eta_min=1e-4, last_epoch=-1, verbose=False),
      "n_d": 8,
      "n_a": 8,
      "n_steps": 7,
      "gamma": 2.0,
      "n_independent": 4,
      "n_shared": 3,
      "momentum": 0.17,
      "lambda_sparse": 0,
      "verbose": 1,
      "mask_type": "entmax"
  }

@Optimox

Optimox commented 2 years ago

@Kayne88 thank you very much for sharing your results.

The model learns to pay attention to specific features in order to minimize the loss function. Some features might end up masked out if they correlate too much with a better feature, however you'll have no guarantee that this is the case. You could simply remove those feature before training.

However you can play with hyperparameters to get closer to what you want:

All these recommendations have no guarantee of working. This is just my general understanding but you should experiment on them and see how it goes.

Good luck!