CUDA out of memory - Githubissues

Aditya-iitdh commented 3 months ago

Dear authors,

Thanks for this amazing work. I am interested in reproducing the results of your paper but I am getting torch.cuda.OutOfMemoryError

Currently I am trying to run it on Google Colab with a T4 GPU. Do you think using one more GPU can solve this problem? If not, what else can I try? Below is the complete output mesage:

/bin/bash: line 2: fg: no job control mkdir: cannot create directory ‘logs/TEMPO/loar_revin_100_percent_1_prompt_equal_1/’: File exists mkdir: cannot create directory ‘logs/TEMPO/loar_revin_100_percent_1_prompt_equal_1/ettm2_pmt1_no_pool_TEMPO_6’: File exists logs/TEMPO/loar_revin_100_percent_1_prompt_equal_1/ettm2_pmt1_no_pool_TEMPO_6/test_336_96_lr0.001.log Namespace(model_id='etth1_TEMPO_6_prompt_learn_336_96_100', checkpoints='./lora_revin_6domain_checkpoints_1/', task_name='long_term_forecast', prompt=1, num_nodes=1, seq_len=336, pred_len=96, label_len=168, decay_fac=0.5, learning_rate=0.001, batch_size=256, num_workers=0, train_epochs=10, lradj='type3', patience=5, gpt_layers=6, is_gpt=1, e_layers=3, d_model=768, n_heads=4, d_ff=768, dropout=0.3, enc_in=7, c_out=1, patch_size=16, kernel_size=25, loss_func='mse', pretrain=1, freeze=1, model='TEMPO', stride=8, max_len=-1, hid_dim=16, tmax=20, itr=1, cos=1, equal=1, pool=False, no_stl_loss=False, stl_weight=0.001, config_path='./configs/multiple_datasets.yml', datasets='ETTm1,ETTh2,ETTm2,electricity,traffic,weather', target_data='ETTh1', use_token=0, electri_multiplier=1, traffic_multiplier=1, embed='timeF') ['ETTm1', 'ETTh2', 'ETTm2', 'electricity', 'traffic', 'weather'] ETTm1 dataset: ett_m train 238903 val 79975 ETTh2 dataset: ett_h self.enc_in = 7 self.data_x = (8640, 7) train 57463 self.enc_in = 7 self.data_x = (3216, 7) val 19495 ETTm2 dataset: ett_m train 238903 val 79975 electricity dataset: custom train 5771901 val 814377 traffic dataset: custom train 10213838 val 1431782 weather dataset: custom train 765576 val 108675 ETTm1 dataset: ett_m train 238903 ETTh2 dataset: ett_h self.enc_in = 7 self.data_x = (8640, 7) train 57463 ETTm2 dataset: ett_m train 238903 electricity dataset: custom train 5771901 traffic dataset: custom train 10213838 weather dataset: custom train 765576 Way1 1251978 self.enc_in = 7 self.data_x = (3216, 7) test 19495 trainable params: 308736 || all params: 82207488 || trainable%: 0.38 0% 0/4891 [00:01<?, ?it/s] Traceback (most recent call last): File "/content/TEMPO/main_multi_6domain_release.py", line 292, in outputs, loss_local = model(batch_x, ii, seq_trend, seq_seasonal, seq_resid) #+ model(seq_seasonal, ii) + model(seq_resid, ii) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/content/TEMPO/models/TEMPO.py", line 446, in forward x = self.gpt2_trend(inputs_embeds =x_all).last_hidden_state File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 642, in forward return self.get_base_model()(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 1116, in forward outputs = block( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 651, in forward feed_forward_hidden_states = self.mlp(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 572, in forward hidden_states = self.act(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/activations.py", line 56, in forward return 0.5 input (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) (input + 0.044715 * torch.pow(input, 3.0)))) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 460.00 MiB. GPU

idevede commented 3 months ago

Hi Aditya,

Thanks for your interest in our work! For the OOM, can you try to reduce the batch size to see if it is helpful?

Best

Aditya-iitdh commented 3 months ago

Thanks, it worked !

idevede commented 3 months ago

Great! We will close this issue accordingly. 👍

DC-research / TEMPO

CUDA out of memory #2