allenai / unifiedqa

UnifiedQA: Crossing Format Boundaries With a Single QA System
https://arxiv.org/abs/2005.00700
Apache License 2.0
428 stars 43 forks source link

Cuda Out Of Memory #20

Closed NacirB closed 3 years ago

NacirB commented 3 years ago

Hi,

Thank you for the great repo.

You stated in your bart finetuning section that with one GPU (16gb), (Tesla V100 in my case) we can fit up to 64 bs. But I m still getting cuda OOM even with a bs=4.

shmsw25 commented 3 years ago

Hi @NacirB,

That seems unexpected. Would you check what is the minimum BS you could fit in with one 16GB GPU? And could you check the GPU memory usage (i.e. how much does it take out of 16GB) when you were able to fit the minimum BS?

NacirB commented 3 years ago

Hey @shmsw25 , thanks for your response.

Here is a screen of my gpu before running the fine-tuning.

image

And here's a screen after fitting a bs==4

image

That's the maximum I can fit in this GPU.

shmsw25 commented 3 years ago

Hi @NacirB, thanks for the screenshot. Would you share the command line you are running as well?

NacirB commented 3 years ago

@shmsw25 here is the command :

python cli.py --do_train --output_dir out/msmarco_unifiedqa --checkpoint model/unifiedQA-uncased/best-model.pt --train_file ../data/msmarco/train.tsv --predict_file ../data/msmarco/dev.tsv --train_batch_size 4 --predict_batch_size 4 --append_another_bos --do_lowercase --verbose --eval_period 10000

shmsw25 commented 3 years ago

Oh, I think I know why it causes OOM --- it's probably related to the fact that you are experimenting with MS MARCO.

The memory usage quadratically as the output length increases. Although we set the maximum output length as 100, it is usually less than 5 in the datasets we experimented with in the paper. In MS MARCO, however, the output is paragraph-level which is much longer and near to a length of 100. So essentially it is requiring much more memory.

I think you can test running the code with one of the datasets we used to test if this hypothesis is correct. If it is, I think there isn't easy way to reduce OOM for MS MARCO -- and it should be true for BART or T5 without UnifiedQA as well. Perhaps you have to use more GPUs or reduce the batch size...? Sorry that it isn't helpful.

NacirB commented 3 years ago

Yes I guess, I that's what's causing the issue. I think I will continue running the code with bs=4. Thanks for your help, I really appreciate it.