CUDA out of memory - Githubissues

rzhangpku commented 5 years ago

When I run python train.py --data_path data/pubmed_abstract --model_dp abstract_model/ --gpu 1 I get this error:

 21 ----------
 22 Epoch 0/99
 23 0 batches processed. current batch loss: 11.326438^M1 batches processed. current batch loss: 11.006483^M2 batches processed. c    urrent batch loss: 10.861076^M3 batches processed. current batch loss: 10.887144^M4 batches processed. current batch loss: 11.    033303^MTraceback (most recent call last):
 24   File "train.py", line 236, in <module>
 25     batch_o_t, teacher_forcing_ratio=1)
 26   File "/home/rongz/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
 27     result = self.forward(*input, **kwargs)
 28   File "/home/rongz/PaperRobot/New paper writing/memory_generator/seq2seq.py", line 18, in forward
 29     stopwords, sflag)
 30   File "/home/rongz/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
 31     result = self.forward(*input, **kwargs)
 32   File "/home/rongz/PaperRobot/New paper writing/memory_generator/Decoder.py", line 134, in forward
 33     max_source_oov, term_output, term_id, term_mask)
 34   File "/home/rongz/PaperRobot/New paper writing/memory_generator/Decoder.py", line 68, in decode_step
 35     term_context, term_attn = self.memory(_h.unsqueeze(0), term_output, term_mask, cov_mem)
 36   File "/home/rongz/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
 37     result = self.forward(*input, **kwargs)
 38   File "/home/rongz/PaperRobot/New paper writing/memory_generator/utils.py", line 32, in forward
 39     e_t = self.vt_layers[i](torch.tanh(enc_proj + dec_proj).view(batch_size * max_enc_len, -1))
 40 RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.91 GiB total capacity; 10.37 GiB already allocated; 5    .06 MiB free; 1019.61 MiB cached)

Here is my GPU infomation:

➜  New paper writing git:(master) ✗ nvidia-smi
Sat Jun 15 20:48:37 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:04:00.0 Off |                  N/A |
| 25%   42C    P0    58W / 250W |      0MiB / 12196MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

And before I run python train.py --data_path data/pubmed_abstract --model_dp abstract_model/ --gpu 1, the 12196MiB GPU memory is all free. Can you help me? Thank you very much!

EagleW commented 5 years ago

@rzhangpku Thank you for your interest in our work. I think maybe you can change the batch_size to 40 by append --batch_size 40 or smaller and uncomment line 254.

EagleW commented 5 years ago

@rzhangpku You can drop me an email if you still have this problem

rzhangpku commented 5 years ago

As you say, I change the batch_size to 30 by append --batch_size 30 and uncomment line 254 in the train.py, then the problem is solved. Thanks a lot for your quick and kind reply.

EagleW commented 5 years ago

Hope you enjoy the experiments and have a good weekend!

rzhangpku commented 5 years ago

Your work "PaperRobot" is very excellent and impressive! Have a nice weekend!

ClaireZTH commented 4 years ago

I have same question, but i can not solve it by decreasing batch_size.

EagleW commented 4 years ago

@ClaireZTH Maybe you can further decrease batch size or choose a server with large GPU memory

EagleW / PaperRobot

CUDA out of memory #5