CUDA error: out of memory

Avmb commented 5 years ago

I get an out of memory error when I run the supervised training example as described by the README

CUDA_VISIbLE_DEVICES=2 python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5 --dico_train default

... INFO - 11/15/18 01:15:45 - 0:00:47 - Cross-lingual word similarity score average: 0.71701 INFO - 11/15/18 01:15:45 - 0:00:47 - Found 2975 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2) Traceback (most recent call last): File "supervised.py", line 101, in evaluator.all_eval(to_log) File "/raid0/amiceli/MUSE/src/evaluation/evaluator.py", line 217, in all_eval self.word_translation(to_log) File "/raid0/amiceli/MUSE/src/evaluation/evaluator.py", line 120, in word_translation dico_eval=self.params.dico_eval File "/raid0/amiceli/MUSE/src/evaluation/word_translation.py", line 105, in get_word_translation_accuracy scores = query.mm(emb2.transpose(0, 1)) RuntimeError: CUDA error: out of memory

The GPU is a GeForce GTX TITAN X with 12212MiB of memory, and there is nothing else running on it.

I'm using CUDA 9.2.148, Python 2.7.12, PyTorch 0.4.1, Faiss just downloaded from the repository and compiled with GPU support.

glample commented 5 years ago

Hi,

Are you using the default dictionaries for evaluation?

Avmb commented 5 years ago

Yes, I just ran the commands on the README.md to get the evaluation files and the English and Spanish embeddings.

Avmb commented 5 years ago

It's the same even if I use CPU Faiss or no Faiss at all.

glample commented 5 years ago

Can you try to see if adding the parameter --max_vocab 1000 helps? And if so, try with --max_vocab 10000 and see for which vocabulary size it breaks?

12GB should be more than enough for the default --max_vocab 200000 so I'm not sure.

Avmb commented 5 years ago

They all crash with different errors:

For --max_vocab 1000 I got:

INFO - 11/16/18 00:43:42 - 0:00:09 - Cross-lingual word similarity score average: 0.33936 INFO - 11/16/18 00:43:42 - 0:00:09 - Found 0 pairs of words in the dictionary (0 unique). 2975 other pairs contained at least one unknown word (2975 in lang1, 2928 in lang2) Traceback (most recent call last): File "supervised.py", line 101, in evaluator.all_eval(to_log) File "/raid0/amiceli/MUSE/src/evaluation/evaluator.py", line 217, in all_eval self.word_translation(to_log) File "/raid0/amiceli/MUSE/src/evaluation/evaluator.py", line 120, in word_translation dico_eval=self.params.dico_eval File "/raid0/amiceli/MUSE/src/evaluation/word_translation.py", line 95, in get_word_translation_accuracy assert dico[:, 0].max() < emb1.size(0) IndexError: too many indices for tensor of dimension 1

For --max_vocab 10000 I got:

INFO - 11/16/18 00:44:01 - 0:00:05 - Cross-lingual word similarity score average: 0.73309 INFO - 11/16/18 00:44:01 - 0:00:05 - Found 1232 pairs of words in the dictionary (989 unique). 1743 other pairs contained at least one unknown word (0 in lang1, 1743 in lang2) INFO - 11/16/18 00:44:01 - 0:00:06 - 989 source words - nn - Precision at k = 1: 79.069767 INFO - 11/16/18 00:44:01 - 0:00:06 - 989 source words - nn - Precision at k = 5: 88.068756 INFO - 11/16/18 00:44:01 - 0:00:06 - 989 source words - nn - Precision at k = 10: 90.293225 INFO - 11/16/18 00:44:01 - 0:00:06 - Found 1232 pairs of words in the dictionary (989 unique). 1743 other pairs contained at least one unknown word (0 in lang1, 1743 in lang2) stack smashing detected : python terminated Aborted (core dumped)

I suspect the first one might be a bug in your code, but the second one and possibly the out of memory errors are CUDA problems. If you haven't seen them before it might be an issue with how CUDA is installed on the machine I'm using.

glample commented 5 years ago

The first one seems normal actually, probably we should add a warning or something, but if vocabulary size is too small then the dictionary of evaluation will just be empty.

Regarding the second one, I really don't know, I have never seen this before. Do you have another machine on which you can try? Also, can you try to see if this works on CPU only (to see if the problem is really CUDA)?

Avmb commented 5 years ago

It seems that the problem was caused by a version mismatch between CUDA and nvcc on my system. Solved now.

facebookresearch / MUSE

CUDA error: out of memory #87