Large dataset is not training

dreamquark-ai / tabnet

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf

https://dreamquark-ai.github.io/tabnet/

MIT License

2.55k stars 470 forks source link

Large dataset is not training #488

Closed amos-coder closed 11 months ago

amos-coder commented 1 year ago

Describe the bug When I am training with large dataset the model is trainied completely for given epochs but after that it was processing for some time and the process is killed

What is the current behavior? i tried with small dataset i works completely fine but when I am doing training with large dataset this problem occurs

Expected behavior I need the fit function to finish and move to next step

Screenshots ] Screenshot from 2023-06-12 10-57-29

Other relevant information: python version: python 3.8.8 Operating System: ubuntu

Additional context

Optimox commented 1 year ago

Can you share the the rest of the error message?

Killed process is due to out of memory error.

So I would suggest to:

try to reduce your chunk_size and see if it works
read your training data by chunks and free your memory -> here X_train[start:end] will simply be a copy of a chunk of a large dataset, you are not freeing memory from your computer but adding an extra consumption.
do not evaluate your model on the training set by setting eval_set=[(x_valid, y_valid)] as evaluation requires to save all predictions and targets to memory for auc computation.

amos-coder commented 1 year ago

your Can you share the the rest of the error message? Thats the complete error message.

read your training data by chunks and free your memory -> here X_train[start:end] will simply be a copy of a chunk of a large dataset, you are not freeing memory from your computer but adding an extra consumption => how to acheive that did I need to write my own dataloader and try it?

Optimox commented 1 year ago

no need for a custom loader, just never load your entire dataset :

if you can do this X_train[start:end] it means that you have your entire X_train in memory
if your data is saved on csv file, just read some lines for every chunk and never load your entire dataset. If it's an other format, you can certainly load only some chunks but not then entire dataset

amos-coder commented 1 year ago

no need for a custom loader, just never load your entire dataset :

if you can do this X_train[start:end] it means that you have your entire X_train in memory

if your data is saved on csv file, just read some lines for every chunk and never load your entire dataset. If it's an other format, you can certainly load only some chunks but not then entire dataset

Thanks for your reply! But the problem is I am training a binary classification data if I load as chunk and train means some chunk will have full of 0 labeled data and another chunk will have full of 1 labeled data and the training will not be efficient.Please correct me if I was wrong

amos-coder commented 1 year ago

no need for a custom loader, just never load your entire dataset :

if you can do this X_train[start:end] it means that you have your entire X_train in memory

if your data is saved on csv file, just read some lines for every chunk and never load your entire dataset. If it's an other format, you can certainly load only some chunks but not then entire dataset

Also loading the data and training the data is not a problem here.After training num of desired epochs some process is running i don't know what it was it only takes so much time and memory.Will be happy if you explain what it was! Thanks

Optimox commented 1 year ago

If your code runs fine with smaller dataset, this means that the issue comes from the memory, so smaller chunks should help and not loading your entire dataset should help as well.

I have no mean to reproduce your error so I can't help you more than that.

You could preprocess your data by chunk beforehand and make sure that all your chunks have the same positive/negative ratio.

amos-coder commented 1 year ago

Screenshot from 2023-06-13 10-59-02

I tried using a lesser parameters and 60% of the total dataset still I am seeing the processing for hours after training desired epochs( 1 epoch) Thanks for your help in advance

amos-coder commented 1 year ago

![Screenshot from 2023-06-13 12-29-14](https://github.com/dreamquark-ai/tabnet/assets/78432329/fcaea85a-b0ed-432e-a8ea-d5fbd3d40a79 Found that line 271 is reason for the large process time I commented that after it works prefectly fine.

Optimox commented 1 year ago

The feature importance is computed by default during training, how many columns do you have ?

amos-coder commented 1 year ago

The feature importance is computed by default during training, how many columns do you have ?

997 columns

Optimox commented 11 months ago

This #493 should make things easier in the future