[Beginner] ClassificationModel Running out of Memory, long training Epochs

NiklasHoltmeyer commented 3 years ago

Disclaimer: I don't know where to post this question, if this is not the right place for SimpleTransformer beginner questions, I would appreciate a reference to the right place.

Hi guys, I am new to Deep Learning and wanted to train a binary (sentiment) classification using SimpleTransformers. As a dataset I took Sentiment140 (1,6 Tweets 800k Positive, 800k Negative). The training itself works, but depending on the length of the dataset Google Colab crashes. If I divide the 1.6 million tweets into 1.28 million training and 0.32 million test data the model crashes after ->

[2020-12-28 16:55:15,023] {classification_model.py:1147} INFO -  Converting to features started. Cache is not used.
100%
1278719/1278719 [09:25<00:00, 2260.76it/s]

(1) Is this normal? Now if I reduce the number to 800k training, 160k test data Google Colab does not crash, but one epoch takes 4 hours. (This number often works, sometimes 800k training-data also crashes as described above. When it gets to training, I don't even know if it goes through - since an epoch lasts 4 hours, I've never run it through) I do not know how far you can compare the things, but in tensorflow i have trained a CNN, BiLSTM network on the entire data set and there an epoch took only 5 minutes, (2) does 4 hours make sense, or have I made a gross error?

[2020-12-28 17:45:10,844] {classification_model.py:1147} INFO -  Converting to features started. Cache is not used.
100%
800000/800000 [05:44<00:00, 2638.77it/s]
Epoch 1 of 1: 0%
0/1 [00:00<?, ?it/s]
Epochs 0/1. Running Loss. 0.6640: 0% 375/100000 [01:04<3:50:03, 7.19it/s]

import torch
torch.cuda.is_available()

True

model_type, model_name = 'roberta', 'roberta-base'

model_args = {
   'output_dir': 'outputs/',
   'cache_dir': 'cache/',

   'max_seq_length': 144,
   'num_train_epochs': 1,#50
   'learning_rate': 1e-3, 
   'adam_epsilon': 1e-8,
  "early_stopping_delta" : 1e-3,
  "early_stopping_patience" : 5, #5
   'overwrite_output_dir': True,
    'manual_seed' : True,
    'silent' : False
} 

model = ClassificationModel(model_type=model_type, model_name=model_name, args=model_args, 
                            use_cuda=True, 
                            num_labels=2)

I also tried to add 'eval_accumulation_steps' : 20 to my model_args, but it still crashed pre-training

ty in advanced

ThilinaRajapakse commented 3 years ago

I think the full dataset of 1.6 million tweets is stretching the available RAM on Colab to the limit. However, it looks like the preprocessing completed before it crashed and it then crashed when trying to cache the features. If that's the case, you might be able to use the full dataset by setting "no_cache": True in model_args.

The time taken for training depends a lot on the hardware and 4 hours per epoch seems reasonable for such a large dataset on a Colab GPU. Transformer models are generally much larger and more resource-intensive compared to the typical LSTM or CNN model, so it makes sense for the training to take longer. However, you will rarely need to do more than 1 or 2 epochs with a Transformer model. You'll probably be fine with just 1 epoch considering the dataset is quite large.

If you want to speed things up, consider using distilroberta-base instead of roberta-base. It's a smaller model with close to the same performance. Also, you should make your train_batch_size as large as your GPU can handle, i.e. the largest value which doesn't throw a CUDA memory error. Generally, this will be around 4-16.

NiklasHoltmeyer commented 3 years ago

0) I still got an out of Memory Error while using "no_cache": True and the full dataset. With this Setting & train_batch_size = 40 , 800k Trainingsdata one Epoch takes 2 Hours now - which is a great improvement :) ill just run 2 epochs and should be fine maybe?

1) train_batch_size I wanted to start low and tried train_batch_size': 125, and i already run out of memory :D

RuntimeError: CUDA out of memory. Tried to allocate 106.00 MiB (GPU 0; 14.73 GiB total capacity; 13.45 GiB already allocated; 69.88 MiB free; 13.73 GiB reserved in total by PyTorch)

I tried it with a "clean" Runtime. 'train_batch_size': 50 gave me the same error, 'train_batch_size': 40, worked.

2) As I said before I am new to this Deep Learning area, is one epoch enough in this case because this is a type of transfer learning?

3) ` model_type, model_name = "distilbert", "distilroberta-base"
gave me

Can't set hidden_size with value 768 for DistilBertConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "model_type": "distilbert",
  "pad_token_id": 1
}

ThilinaRajapakse commented 3 years ago

In that case, you can also try using lazy loading. This will preprocess the data on the fly so it doesn't need to keep the full dataset in memory.

A batch size of about 40 seems reasonable.

Yes, it's because fine-tuning a model is transfer learning.

distilroberta-base is a roberta model so you need to do: model_type, model_name = "roberta", "distilroberta-base"

NiklasHoltmeyer commented 3 years ago

how could i use lazy loading with a dataframe? should i save my cleaned dataframe as csv (without header, and first text then label) and just change seperator/delmiter to ? so that i have text a text a text a; 1 blalsdlasdlasld asdlasldalsd asldasld; 2

btw ty for your great work!

ThilinaRajapakse commented 3 years ago

Yes, you need to save your dataframe as a csv (technically a tsv) file. The delimiter should be a tab space.

E.g:

df.to_csv("my_data_file.tsv", sep="\t")

NiklasHoltmeyer commented 3 years ago

thanks ill try that :)

model_type, model_name = "roberta", "distilroberta-base" didnt work for me, but model_type, model_name = "distilbert", "distilbert-base-uncased" did :)

its training, i hope the training will finish, before google colab kicks me off :D

"bertweet", "vinai/bertweet-covid19-base-uncased" -> takes approx. 8 1/2 hours "bert", "bert-base-uncased" -> approx 4 1/2 hrs "distilbert", "distilbert-base-uncased" -> approx 4 hours

the second and last should work, but the first might be problematic

ty for all your work :) and great help :)

ThilinaRajapakse / simpletransformers

[Beginner] ClassificationModel Running out of Memory, long training Epochs #920