Closed NacirB closed 3 years ago
Hi @NacirB,
That seems unexpected. Would you check what is the minimum BS you could fit in with one 16GB GPU? And could you check the GPU memory usage (i.e. how much does it take out of 16GB) when you were able to fit the minimum BS?
Hey @shmsw25 , thanks for your response.
Here is a screen of my gpu before running the fine-tuning.
And here's a screen after fitting a bs==4
That's the maximum I can fit in this GPU.
Hi @NacirB, thanks for the screenshot. Would you share the command line you are running as well?
@shmsw25 here is the command :
python cli.py --do_train --output_dir out/msmarco_unifiedqa --checkpoint model/unifiedQA-uncased/best-model.pt --train_file ../data/msmarco/train.tsv --predict_file ../data/msmarco/dev.tsv --train_batch_size 4 --predict_batch_size 4 --append_another_bos --do_lowercase --verbose --eval_period 10000
Oh, I think I know why it causes OOM --- it's probably related to the fact that you are experimenting with MS MARCO.
The memory usage quadratically as the output length increases. Although we set the maximum output length as 100, it is usually less than 5 in the datasets we experimented with in the paper. In MS MARCO, however, the output is paragraph-level which is much longer and near to a length of 100. So essentially it is requiring much more memory.
I think you can test running the code with one of the datasets we used to test if this hypothesis is correct. If it is, I think there isn't easy way to reduce OOM for MS MARCO -- and it should be true for BART or T5 without UnifiedQA as well. Perhaps you have to use more GPUs or reduce the batch size...? Sorry that it isn't helpful.
Yes I guess, I that's what's causing the issue. I think I will continue running the code with bs=4. Thanks for your help, I really appreciate it.
Hi,
Thank you for the great repo.
You stated in your bart finetuning section that with one GPU (16gb), (Tesla V100 in my case) we can fit up to 64 bs. But I m still getting cuda OOM even with a bs=4.