Runtime memory error for SNLI dataset

gitathrun commented 6 years ago

Hi, I am running my model with SentEval framework and using SNLI as the target dataset, but I got runtime memory error after the embedding process is completed, and beginning the classifier training process. here is the error message: for the record, the runtime environment is 2 K80 GPU with CUDA, memory size is 112GiB, but I am not sure this process used two GPU or just one, so not sure the runtime memory size is 11GiB or 22GiB

2018-04-20 11:41:42,001 : Training pytorch-MLP-nhid0-adam-bs64 with standard validation..
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1518241554738/work/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "snm_senteval_dsvm_gpu.py", line 62, in <module>
    results = se.eval(transfer_tasks)
  File "../senteval/engine.py", line 56, in eval
    self.results = {x: self.eval(x) for x in name}
  File "../senteval/engine.py", line 56, in <dictcomp>
    self.results = {x: self.eval(x) for x in name}
  File "../senteval/engine.py", line 94, in eval
    self.results = self.evaluation.run(self.params, self.batcher)
  File "../senteval/snli.py", line 108, in run
    devacc, testacc = clf.run()
  File "../senteval/tools/validation.py", line 218, in run
    validation_data=(self.X['valid'], self.y['valid']))
  File "../senteval/tools/classifier.py", line 79, in fit
    accuracy = self.score(devX, devy)
  File "../senteval/tools/classifier.py", line 120, in score
    devX = torch.FloatTensor(devX).cuda()
  File "/anaconda/envs/py35/lib/python3.5/site-packages/torch/_utils.py", line 69, in _cuda
    return new_type(self.size()).copy_(self, async)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1518241554738/work/torch/lib/THC/generic/THCStorage.cu:58

I switched classifier option to sklearn logistic regression (UsePytorch = False) with RAM at 10GiB, the error still shows (no memory error), but different description.

I am just wandering, since the SNLI is a 110K size dataset, how many memory size is capable for the classifier to process the embedded sentences.

Can you guys share your development experience when you testing the SNLI dataset?
How much memory size were you guys using at that time?

Many Thanks

kzinmr commented 6 years ago

I faced the same problem in similar envrionment and avoided it by editing classifier.PyTorchClassifier.score():

Remove lines devX = torch.FloatTensor(devX).cuda() and devy = torch.LongTensor(devy).cuda(),
Replace devX[i:i + self.batch_size] with torch.FloatTensor(devX[i:i + self.batch_size]).cuda() as well as devy[i:i + self.batch_size] with torch.LongTensor(devy[i:i + self.batch_size]).cuda().

This worked for me with large dataset like SNLI.

aconneau commented 6 years ago

Could you please try to:

replace this line : https://github.com/facebookresearch/SentEval/blob/master/senteval/tools/classifier.py#L119

by

if not isinstance(devX, torch.cuda.FloatTensor) and not self.cudaEfficient:

Thanks

aconneau commented 6 years ago

Hi, were you able to fix the problem? Thanks, Alexis

aconneau commented 6 years ago

Please re-open the task if not.

facebookresearch / SentEval

Runtime memory error for SNLI dataset #28