Closed zjgbz closed 1 year ago
Hello, The data is only put to GPU by batch. However, the pretrainer model itself is sent to GPU and is a much bigger model than a classifier for example (because there is a decoding part). So you might want to try to diminish number of steps or size of n_d and n_a and see if it fits into your GPU.
https://github.com/dreamquark-ai/tabnet/pull/348
New release is out, so your problem might be solved by installing tabnet 4.0 Please let me know if things work with the latest release.
348
New release is out, so your problem might be solved by installing tabnet 4.0 Please let me know if things work with the latest release.
I will try it now. Thank you very much!
348
New release is out, so your problem might be solved by installing tabnet 4.0 Please let me know if things work with the latest release.
Hi @Optimox ,
It works for my task with tabnet 4.0! I use the default values of n_a
, n_d
, and step_size
, and perform this TabNetPretrainer
on the training dataset with dimension 10581339 x 79 and validation dataset with dimention 2645335 x 79. I am just wondering how you realize this because for next step, I need to double the observations of training and validation dataset, and I am wondering with larger dataset, I just need more running time or I might suffer from CUDA out of memory error again. Thank you very much!
@zjgbz I did nothing recently, the problem you raised was already fixed but we were waiting for a release. The fix was done a long time ago so I did not remember it.
We used to stack the prediction using the GPU which makes things faster but is problematic for large datasets and especially for pretrainer as the prediction has the size of the original dataset. Now this is fixed.
Closing the issue,
Best,
Seb
Hi TabNet Team,
I tried to perform
TabNetPretrainer
on a tabluar dataset (10581339 x 79). I first read it into CPU memory and then convert it to numpy array which is supported byTabNetPretrainer
using the codes below.The dimension of
X_train
is 10581339 x 79, and the dimension ofX_valid
is 2645335 x 79. The error message isI tried to decrease the
batch_size
to 512, but the same error happened. When I tried to decrease the size ofX_train
andX_valid
to 1/10 which means that ~ 1M x 79 and 0.3M x 79, even withbatch_size = 2048
, I can pretrain the module successfully. Does it mean thatTabNetPretrainer
first load all data into GPU, and process it batch by batch? Or may I have your help on this issue?Thank you very much.