ChunchuanLv / AMR_AS_GRAPH_PREDICTION

53 stars 16 forks source link

cuda runtime error: out of memory #13

Closed ButteredGroove closed 5 years ago

ButteredGroove commented 5 years ago

Hi and thanks for the AMR parser and paper! I was able to use it to train a model and get scores for LDC2017. It ran fine and I got results. My K80 12GB GPU went up to around 10GB of RAM usage but finished without issue.

I then grabbed another data set and tried again. src/preprocessing, src/rule_system_build.py, and src/data_build.py all completed. However, src/train.py crashed with an out of memory issue. It gets as far as starting epoch 1, but after processing a random number of batches it crashes.

I tried using smaller batch sizes via train.py's -batch_size argument, but even a batch size of 1 resulted in an OOM issue.

Is there another train.py setting that you'd recommend that I set to prevent the crash? Any other ideas?

Here are the details of the crash:

THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory Traceback (most recent call last): File "src/train.py", line 695, in main() File "src/train.py", line 670, in main trainModel(AmrModel, AmrDecoder,training_data, dev_data, dicts, optim,best_f1=f1) File "src/train.py", line 431, in trainModel train_loss = trainEpoch(epoch) File "src/train.py", line 340, in trainEpoch rel_prob,roots = model((rel_batch,rel_index_batch,srcBatch, posteriors_likelihood_score[0]),rel=True) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 224, in call result = self.forward(*input, kwargs) File "/home/jovyan/AMR_AS_GRAPH_PREDICTION-master/parser/models/init.py", line 289, in forward src_enc = self.rel_encoder(srcBatch,indexed_posterior) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 224, in call result = self.forward(*input, *kwargs) File "/home/jovyan/AMR_AS_GRAPH_PREDICTION-master/parser/models/MultiPassRelModel.py", line 234, in forward Outputs = self.rnn(poster_emb, hidden)[0] File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 224, in call result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 162, in forward output, hidden = func(input, self.all_weights, hx) File "/usr/local/lib/python3.6/dist-packages/torch/nn/_functions/rnn.py", line 351, in forward return func(input, *fargs, *fkwargs) File "/usr/local/lib/python3.6/dist-packages/torch/autograd/function.py", line 284, in _do_forward flat_output = super(NestedIOFunction, self)._do_forward(flat_input) File "/usr/local/lib/python3.6/dist-packages/torch/autograd/function.py", line 306, in forward result = self.forward_extended(*nested_tensors) File "/usr/local/lib/python3.6/dist-packages/torch/nn/_functions/rnn.py", line 293, in forward_extended cudnn.rnn.forward(self, input, hx, weight, output, hy) File "/usr/local/lib/python3.6/dist-packages/torch/backends/cudnn/rnn.py", line 291, in forward fn.reserve = torch.cuda.ByteTensor(reserve_size.value) RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:66

zjxs1997 commented 5 years ago

My suggestion is that you should try to run it on CPU first, maybe something else goes wrong, but running on GPU doesn't report the exact error.

ButteredGroove commented 5 years ago

Good idea. Thank you.

I went ahead and found a GPU with more RAM (24GB) and it worked. Because this was an increase in requirements based on my own corpus and no clear issue with the code itself, I'll go ahead and close this.