CUDA out of memory with pre-training tabnet

rahul-tripathi commented 3 years ago

When pre-training tabnet on a training dataset of size 1.2M x 375 (after taking embedding dimensions into account), it fails with the below GPU out of memory error:

...pytorch_tabnet/sparsemax.py in forward(ctx, input, dim) 125 input = input / 2 # divide by 2 to solve actual Entmax 126 --> 127 taustar, = Entmax15Function._threshold_and_support(input, dim) 128 output = torch.clamp(input - tau_star, min=0) ** 2 129 ctx.save_for_backward(output)

...pytorch_tabnet/sparsemax.py in _threshold_and_support(input, dim) 148 mean_sq = (Xsrt * 2).cumsum(dim) / rho --> 149 ss = rho (mean_sq - mean ** 2) 151 delta = (1 - ss) / rho 152

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 5; 11.17 GiB total capacity; 10.63 GiB already allocated; 192.00 KiB free; 10.65 GiB reserved in total by PyTorch)

The training data size is not too large; so it is surprising that it fails with this GPU memory issue. I also benchmarked the GPU memory usage on the "pretraining_example" jupyter notebook that comes with the code repository. There, it uses 0.5 GB GPU memory on a training dataset of size 26K x 14, which again seems too large for this small dataset. All tests were done on a Tesla K80 GPU with 12 GB GPU memory.

It would be interesting to know the root cause of this GPU memory issue, given that the datasets are not too large and should be enough for this 1 GPU.

Optimox commented 3 years ago

Hello,

Is this related to Kaggle Optiver competition?

The number of samples in the dataset is not meaningful for GPU memory, only batch size and input size matter.

Pretraining is usually much larger than simple training because decoder is bigger and input dim is usually larger than output dim.

Maybe reducing batch size can help you avoid memory error.

rahul-tripathi commented 3 years ago

No, this is not related to Kaggle, but a private dataset. I tried reducing batch size all the way from 1024 to 128, but the same GPU memory error persists. So, reducing batch size does not help.

Optimox commented 3 years ago

I'm closing this as we do not have a way to reproduce this nor be sure that there is a real problem except that the pretrainer is too big for your GPU.

Feel free to reopen with a reproducible code so that I can take a close look.

Thank you

harry-fuyu commented 1 year ago

'''This comment does not re-raise this issue but just an FYI''' First of all, thank you for creating this neat and wonderful implementation!

I encounter the same strange issue of GPU OOM, which does not depend on neither batch_size nor model size, but on the size of my entire dataset instead. I traced the problem to predict_epoch() in pretraining.py. The reason is that around line 347, metrics are calculated on the single stacked batches, instead of per-batch. Hence embedded_x and obf_vars would have size (#entire_dataset * #features), which might require too much GPU memory.

If one has version before v4.0 (which I don't think is on conda-forge yet), I would suggest to put embedded_x and obv_vars back to cpu before calling metric_fn inside the loop around line 340.

This change is partially reflected in v4.0. However, I'm a bit confused about it since in v4.0, embedded_x are not only put onto cpu first, but also converted to numpy. But the UnsupervisedLoss is only compatible with torch.Tensor. I'm not sure if it wouldn't throw an error there.

Suggestion: maybe we could evaluate metrics per batch inside the loop and then aggregate, and keep things on GPU?

P.S. I installed pytorch==1.12.1 and pytorch-tabnet==3.1.1 with -c conda-forge

Optimox commented 1 year ago

Thanks @harry-fuyu, I need to have a look at what you are pointing at.

about the unsupervised loss throwing an error: this should be included in the CI tests, so I doubt that this could simply through an error. Also, losses and metrics are well separated in this implementation, main reason is explained bellow about batch computation.
about metrics evaluation by batch: the thing is, many metrics can't be computed by batch and average at the end (AUC for example). So we would need to different types of metrics (batch compatible and non batch compatible). Also this would require numpy and torch implementations of metrics. I think it would make things more complicated. Usually computing the metrics for tabular data is pretty fast and has a negligible impact on the total training time.

dreamquark-ai / tabnet

CUDA out of memory with pre-training tabnet #325

When pre-training tabnet on a training dataset of size 1.2M x 375 (after taking embedding dimensions into account), it fails with the below GPU out of memory error:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 5; 11.17 GiB total capacity; 10.63 GiB already allocated; 192.00 KiB free; 10.65 GiB reserved in total by PyTorch)