Closed mapingshuo closed 5 years ago
I think you should adjust the batch_size for full gpu usage. -b
or --batch_size
option can let you change the batch_size. :) and it seems 2742MiB is the dataset and 10296 - 2742 =7,554 Mib would be the main model size. So... I guess you can size up your data batch_size into 3time more large.
Wow. 8 V100...
I think the problem is with this line:
device = torch.device("cuda:0" if cuda_condition else "cpu")
Only one gpu is used in this situation
@mapingshuo BTW... what a beautiful experimental device.. V100 x8...
@jiqiujia well DataParallel automatically distributed to each GPU as you can see the 1-7 gpu memory usage
@codertimo I modified the batch_size from 64 to 128, this helps a little. When I tried a bigger batch_size (192), I get the Out Of Memory error. I'm trying to figure out the OOM problem now :)
hah @mapingshuo sounds like your adding new GPU device
1 GPU is quicker than 8 GPU in parallel, funny.
Oh, I found my vocab is too big (300,000) and I got 160 million parameters. Then I tokenized my original text and set min_freq as 5 when generating vocabulary. Now, the vocab_size became 30,000 and the parameters num became 22 million. The training is much quicker, because I can modified the batch_size from 128 to 1024.
Maybe you can warn people not to generate vocabulary with plain text, haha.
Still, only 10% ~ 20% of the GPU is in use. Confused.
@mapingshuo maybe it's time to increase the batch_size as fully as possible to fit in GPU memory
Try to move the criterion function in the forward function in bert.py and just return the loss of masked language model. (the output of the encoder is not used in the back),The speed will be more than doubled. and the use of gpu will be more than 70%-90%
I know the trick from a newbee colleague.
1 GPU is quicker than 8 GPU in parallel, funny.
hi can you pls reply on how to overcome that because it is the problem that i have been facing as well
1 GPU is quicker than 8 GPU in parallel, funny.
hi can you pls reply on how to overcome that because it is the problem that i have been facing as well
Sorry, I forgot...
Try to move the criterion function in the forward function in bert.py and just return the loss of masked language model. (the output of the encoder is not used in the back),The speed will be more than doubled. and the use of gpu will be more than 70%-90%
I know the trick from a newbee colleague.
Sorry, I met OOM problem, but I don't understand what you meant exactly, is there something different between your screenshot and the author's origin code?
Hi, I am currently pretaining the BERT on my own data. I use the alpha0.0.1a5 branch (newest version).
I found only 20% of the GPU is in use.
I am not familiar with pytorch. Any one konws why?