CUDA out of memory when performing TabNetPretrainer on large dataset

zjgbz commented 1 year ago

Hi TabNet Team,

I tried to perform TabNetPretrainer on a tabluar dataset (10581339 x 79). I first read it into CPU memory and then convert it to numpy array which is supported by TabNetPretrainer using the codes below.

unsupervised_model = TabNetPretrainer(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    mask_type='entmax', # "sparsemax"
)

max_epochs = 1000
unsupervised_model.fit(
    X_train=X_train,
    eval_set=[X_valid],
    max_epochs=max_epochs , patience=10,
    batch_size=1024, virtual_batch_size=128,
    num_workers=0,
    drop_last=False,
    pretraining_ratio=0.5,
)

The dimension of X_train is 10581339 x 79, and the dimension of X_valid is 2645335 x 79. The error message is

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-15-bba5ed14a698> in <module>
      7     num_workers=0,
      8     drop_last=False,
----> 9     pretraining_ratio=0.5,
     10 )

~/miniconda3/envs/pytorch_tabnet/lib/python3.7/site-packages/pytorch_tabnet/pretraining.py in fit(self, X_train, eval_set, eval_name, loss_fn, pretraining_ratio, weights, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last, callbacks, pin_memory)
    147             self._callback_container.on_epoch_begin(epoch_idx)
    148 
--> 149             self._train_epoch(train_dataloader)
    150 
    151             # Apply predict epoch to all eval sets

~/miniconda3/envs/pytorch_tabnet/lib/python3.7/site-packages/pytorch_tabnet/pretraining.py in _train_epoch(self, train_loader)
    273             self._callback_container.on_batch_begin(batch_idx)
    274 
--> 275             batch_logs = self._train_batch(X)
    276 
    277             self._callback_container.on_batch_end(batch_idx, batch_logs)

~/miniconda3/envs/pytorch_tabnet/lib/python3.7/site-packages/pytorch_tabnet/pretraining.py in _train_batch(self, X)
    300         batch_logs = {"batch_size": X.shape[0]}
    301 
--> 302         X = X.to(self.device).float()
    303 
    304         for param in self.network.parameters():

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 15.78 GiB total capacity; 14.58 GiB already allocated; 2.00 MiB free; 14.75 GiB reserved in total by PyTorch)

I tried to decrease the batch_size to 512, but the same error happened. When I tried to decrease the size of X_train and X_valid to 1/10 which means that ~ 1M x 79 and 0.3M x 79, even with batch_size = 2048, I can pretrain the module successfully. Does it mean that TabNetPretrainer first load all data into GPU, and process it batch by batch? Or may I have your help on this issue?

Thank you very much.

Optimox commented 1 year ago

Hello, The data is only put to GPU by batch. However, the pretrainer model itself is sent to GPU and is a much bigger model than a classifier for example (because there is a decoding part). So you might want to try to diminish number of steps or size of n_d and n_a and see if it fits into your GPU.

Optimox commented 1 year ago

https://github.com/dreamquark-ai/tabnet/pull/348

New release is out, so your problem might be solved by installing tabnet 4.0 Please let me know if things work with the latest release.

zjgbz commented 1 year ago

348

New release is out, so your problem might be solved by installing tabnet 4.0 Please let me know if things work with the latest release.

I will try it now. Thank you very much!

zjgbz commented 1 year ago

348

New release is out, so your problem might be solved by installing tabnet 4.0 Please let me know if things work with the latest release.

Hi @Optimox ,

It works for my task with tabnet 4.0! I use the default values of n_a, n_d, and step_size, and perform this TabNetPretrainer on the training dataset with dimension 10581339 x 79 and validation dataset with dimention 2645335 x 79. I am just wondering how you realize this because for next step, I need to double the observations of training and validation dataset, and I am wondering with larger dataset, I just need more running time or I might suffer from CUDA out of memory error again. Thank you very much!

Optimox commented 1 year ago

@zjgbz I did nothing recently, the problem you raised was already fixed but we were waiting for a release. The fix was done a long time ago so I did not remember it.

We used to stack the prediction using the GPU which makes things faster but is problematic for large datasets and especially for pretrainer as the prediction has the size of the original dataset. Now this is fixed.

Closing the issue,

Best,

Seb

dreamquark-ai / tabnet

CUDA out of memory when performing TabNetPretrainer on large dataset #435

348

348