GPU Running Out of Memory for Train_Chunking.py

damitkwr commented 6 years ago

Hi,

I tried running the Train_Chunking.py on a GPU namely a 1080Ti and a Tesla V100 AMI by changing the elmo_cuda_device = 0 on Line 52. It seems to run out of memory on both the 1080Ti and the V100 so I am wondering if it is something I need to change.

Expected Behaviour: For it to basically run.

Actual Behaviour: Initializing ELMo. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory Traceback (most recent call last): File "Train_Chunking.py", line 58, in <module> pickleFile = perpareDataset(datasets, embLookup) File "/home/ubuntu/elmo-bilstm-cnn-crf/util/preprocessing.py", line 40, in perpareDataset addEmbeddings(pklObjects['data'][datasetName][datasplit], embeddingsFct, padOneTokenSentence) File "/home/ubuntu/elmo-bilstm-cnn-crf/util/preprocessing.py", line 57, in addEmbeddings sentence['word_embeddings'] = embeddingsFct(sentence['tokens']) File "/home/ubuntu/elmo-bilstm-cnn-crf/neuralnets/ELMoWordEmbeddings.py", line 30, in sentenceLookup elmo_vector = self.getElmoEmbedding(sentence) File "/home/ubuntu/elmo-bilstm-cnn-crf/neuralnets/ELMoWordEmbeddings.py", line 80, in getElmoEmbedding self.elmo = ElmoEmbedder(self.elmo_options_file, self.elmo_weight_file, self.elmo_cuda_device) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/allennlp/commands/elmo.py", line 142, in __init__ self.elmo_bilm = self.elmo_bilm.cuda(device=cuda_device) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 249, in cuda return self._apply(lambda t: t.cuda(device)) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 176, in _apply module._apply(fn) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 176, in _apply module._apply(fn) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 176, in _apply module._apply(fn) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 182, in _apply param.data = fn(param.data) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 249, in <lambda> return self._apply(lambda t: t.cuda(device)) RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu:58

It runs fine on CPU but it takes a ton of time just to embed.

I also tried creating a cache first and it works on the GPU using Create_ELMo_Cache.py BUT the prepareDataset() call on line 58 creates a different error.

` FileNotFoundError Traceback (most recent call last) /mnt/sda1/Dropbox/Dropbox/Univeristy Files/Carnegie Mellon/Summer Research/elmo-bilstm-cnn-crf/Train_Chunking.py in () 56 embLookup.loadCache('embeddings/elmo_cache_deid.pkl') 57 ---> 58 pickleFile = perpareDataset(datasets, embLookup) 59 60

/mnt/sda1/Dropbox/Dropbox/Univeristy Files/Carnegie Mellon/Summer Research/elmo-bilstm-cnn-crf/util/preprocessing.py in perpareDataset(datasets, embeddingsClass, padOneTokenSentence) 41 42 ---> 43 f = open(outputPath, 'wb') 44 pkl.dump(pklObjects, f, -1) 45 f.close()

FileNotFoundError: [Errno 2] No such file or directory: 'pkl/deid_ELMoWordEmbeddings_bioEmbeddings_average.pkl' `

So I try changing the Elmo mode since it is looking for _average.pkl?

Thanks for your help.

damitkwr commented 6 years ago

Seems like the second problem could be solved by just making a directory called: pkl in the root directory

damitkwr commented 6 years ago

Second issue seems resolved by just making the directory pkl. It runs smoothly afterwards.

For the first issue, ELMo is just a memory hog so I resolved it by running on a Tesla V100 instance on AWS.

nreimers commented 6 years ago

Hi @damitkwr Yes, the pkl folder was missing in git, I added it to the repo. Thank you for pointing this out.

With ELMo embeddings you have a classical memory / cpu-tradeoff:

You can pre-compute all embeddings for your train/dev/test data. This requires you to store 1024 floats (each 4 bytes) per token in your dataset. This requires about 400 MB of RAM per 100.000 tokens. But it saves you a lot of time when training the network, because the embeddings for the sentence can be loaded out of RAM.

Otherwise you would need to compute them on-the-fly: You save some RAM, but each epoch requires that all ELMo embeddings are again computed, adding a lot of computational time per epoch.

UKPLab / elmo-bilstm-cnn-crf

GPU Running Out of Memory for Train_Chunking.py #2