ShuyangCao / cliff_summ

Code for EMNLP 2021 paper "CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization"
Apache License 2.0
44 stars 8 forks source link

CUDA out of memory when training pegasus with constructed data even if the batch size is set to be 1 #8

Closed Zhang-Henry closed 2 years ago

Zhang-Henry commented 2 years ago
  1. My GPUs are two NVIDIA GeForce RTX 3080 with memory 10G each. When training the pegasus model, the dataset can be loaded, but after that CUDA is out of memory even if the batch size is set to be 1. Is this because the GPU memory is not large enough?
  2. By the way, there's a warning after the datasets are loaded:

    Some weights of PegasusForContrastive were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['classification_head.dense.weight', 'classification_head.dense.bias', 'classification_head.out_proj.weight', 'classification_head.out_proj.bias'] Is that normal or abnormal?

How can I solve the problems?

The main error message is shown below:

Dataset Loaded. Dataset Loaded. Some weights of PegasusForContrastive were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['classification_head.dense.weight', 'classification_head.dense.bias', 'classification_head.out_proj.weight', 'classification_head.out_proj.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of PegasusForContrastive were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['classification_head.dense.weight', 'classification_head.dense.bias', 'classification_head.out_proj.weight', 'classification_head.out_proj.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 0%| | 0/10000 [00:00<?, ?it/s]Traceback (most recent call last): File "contrastive_train.py", line 55, in main() File "contrastive_train.py", line 51, in main trainer.train() File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/transformers/trainer.py", line 1120, in train tr_loss += self.training_step(model, inputs) File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/transformers/trainer.py", line 1524, in training_step loss = self.compute_loss(model, inputs) File "/home/hqh/Desktop/cliff_summ-main/models/pegasus/contrastive_trainer.py", line 84, in computeloss loss, = self._compute_loss(model, inputs) File "/home/hqh/Desktop/cliff_summ-main/models/pegasus/contrastive_trainer.py", line 51, in _compute_loss model_output = model(inputs, use_cache=False) File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/fairscale/nn/data_parallel/sharded_ddp.py", line 230, in forward return self.module(inputs, kwargs) File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/hqh/Desktop/cliff_summ-main/models/pegasus/contrastive_model.py", line 174, in forward return_dict=return_dict, File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/transformers/models/pegasus/modeling_pegasus.py", line 1163, in forward return_dict=return_dict, File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/transformers/models/pegasus/modeling_pegasus.py", line 1024, in forward use_cache=use_cache, File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/transformers/models/pegasus/modeling_pegasus.py", line 443, in forward output_attentions=output_attentions, File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, **kwargs) File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/transformers/models/pegasus/modeling_pegasus.py", line 199, in forward value_states = self._shape(self.v_proj(key_value_states), -1, bsz) File "/usr/local/anaconda3/envs/cliff/lib/python3.7/site-packages/transformers/models/pegasus/modeling_pegasus.py", line 171, in _shape return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous() RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 1; 9.78 GiB total capacity; 7.07 GiB already allocated; 24.81 MiB free; 7.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

ShuyangCao commented 2 years ago

Hi, fine-tuning large models would take at least 16GB of GPU memory, so 10GB would not be enough. If you really want to fine-tune with 10GB of memory, you can try reducing max_input_length, though this would result with truncated documents. For 2, that's normal.