CUDA out of memory - Githubissues

YujiaHu0819 commented 2 years ago

Hello Alasdair,

When I try to reproduce the test, I get the following error. RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 1.96 GiB total capacity; 1.09 GiB already allocated; 9.00 MiB free; 1.12 GiB reserved in total by PyTorch)

And I have tried to adjusted the number of the batch size in the config.yaml, but it did not work. Have you ever encountered this situation? Or do you have any advice for this?

Thank you in advance!

alasdairtran commented 2 years ago

Hi Yujia,

What GPU do you have? From the error, it looks like your GPU only has 2GB of memory. Most of my models, for example, were trained on a Titan V with 12GB of memory.

If reducing the batch size to 1 still gives you OOM error, then you might want to make the model smaller (e.g. reduce the number of layers, hidden size), but then the performance might also suffer.

alasdairtran commented 2 years ago

Alternatively, if you just want to test my pretrained model (without training), then you could also do the evaluation on the CPU. This will take longer to run. To turn off GPU, you can set CUDA_VISIBLE_DEVICES to nothing, e.g.

CUDA_VISIBLE_DEVICES= tell evaluate ....

YujiaHu0819 commented 2 years ago

Hello Alasdair, Thanks for your info. I realized it was the reason my GPU was too small. But after that I tried to run it on CPU(just testing), as you said set CUDA to nothing. but I still got the following error: RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I'm not sure what's wrong, do you have any suggestions?

Thank you so much for your help!

alasdairtran commented 2 years ago

I see. I just pushed two commits (here and here) to check if we're using CPU or GPU and handle the CPU case properly. See if it works.

BTW, I remember evaluating through the whole test set takes about 40 mins or so on a GPU, so it might take many hours on the CPU. If you want a quicker test, you could manually cut off the for-loop early here.

YujiaHu0819 commented 2 years ago

Hello again, Thanks for your quick reply. It seems to have solved the GPU problem. But there is still an assertion error somewhere.

Traceback (most recent call last): File "/Users/yujiahu/opt/anaconda3/envs/tell/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/Users/yujiahu/opt/anaconda3/envs/tell/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/yujiahu/Documents/notebook/transform-and-tell-master1/tell/commands/main.py", line 103, in main() File "/Users/yujiahu/Documents/notebook/transform-and-tell-master1/tell/commands/main.py", line 99, in main args['overrides'], args['eval_suffix']) File "/Users/yujiahu/Documents/notebook/transform-and-tell-master1/tell/commands/evaluate.py", line 74, in evaluate_from_file device, serialization_dir, eval_suffix, batch_weight_key='') File "/Users/yujiahu/Documents/notebook/transform-and-tell-master1/tell/commands/evaluate.py", line 99, in evaluate serialization_dir, f'generations{eval_suffix}.jsonl')) AssertionError

Do you have any advices for this? Thanks a lot!!

alasdairtran commented 2 years ago

Yeah if you trace that error (line 99 in evaluate.py), you'll see that we raise an error if generations.jsonl already exists. So you first need to remove that file inside the serialization directory, e.g.

rm expt/nytimes/9_transformer_objects/serialization/generations.jsonl

or whichever experiment you'd like to run. The evaluate script will create generations.jsonl again and put the generated captions in there.

alasdairtran / transform-and-tell

CUDA out of memory #40